|Home | About | Journals | Submit | Contact Us | Français|
Edited by Dr Katsumi Isono
Buckwheat (Fagopyrum esculentum Moench; 2n = 2x = 16) is a nutritionally dense annual crop widely grown in temperate zones. To accelerate molecular breeding programmes of this important crop, we generated a draft assembly of the buckwheat genome using short reads obtained by next-generation sequencing (NGS), and constructed the Buckwheat Genome DataBase. After assembling short reads, we determined 387,594 scaffolds as the draft genome sequence (FES_r1.0). The total length of FES_r1.0 was 1,177,687,305 bp, and the N50 of the scaffolds was 25,109 bp. Gene prediction analysis revealed 286,768 coding sequences (CDSs; FES_r1.0_cds) including those related to transposable elements. The total length of FES_r1.0_cds was 212,917,911 bp, and the N50 was 1,101 bp. Of these, the functions of 35,816 CDSs excluding those for transposable elements were annotated by BLAST analysis. To demonstrate the utility of the database, we conducted several test analyses using BLAST and keyword searches. Furthermore, we used the draft genome as a reference sequence for NGS-based markers, and successfully identified novel candidate genes controlling heteromorphic self-incompatibility of buckwheat. The database and draft genome sequence provide a valuable resource that can be used in efforts to develop buckwheat cultivars with superior agronomic traits.
The genomes of model plants, such as Arabidopsis thaliana and Oryza sativa (rice), were fully sequenced by the start of the 21st century, and databases containing chromosomal pseudo-molecules and gene annotation information have subsequently been developed and are widely used as tools and resources for plant genomics and genetics studies. Recently, next-generation sequencing (NGS) has emerged as a powerful technique for analysing the genomes of non-model crops in which few molecular genetic studies have been performed. Genome sequences obtained by NGS can be used to construct databases that contain information of genes inferred from available information of genes in model plants. These genome databases in non-model crops will pave the way for the rapid identification of useful genes for crop breeding, which have already been identified in model plants. These databases will also facilitate the construction of fine genetic maps [based on single nucleotide polymorphism, simple sequence repeat (SSR), and NGS-based markers], and make it possible to identify agronomically important genes by map-based cloning. Thus, genome analyses in various non-model crops are underway, and the genomes of >50 non-model crops have already been sequenced (CoGepedia; https://genomevolution.org/). For example, NGS has been used to sequence the genomes of crops that produce beneficial secondary metabolites, such as flavonoid-producing Viburnum trilobum (American cranberry)1 and capsaicin-producing Capsicum annuum (hot pepper),2 and of crops that are tolerant to environmental stress, such as Setaria italica (foxtail millet)3 and Cajanus cajan (pigeonpea),4 which grow in semi-arid regions. NGS technology has opened the door to elucidating the molecular mechanisms that control agronomically important traits, and there is much interest in using this technology to analyse the genomes of non-model crops.
Buckwheat (Fagopyrum esculentum Moench; 2n = 2x = 16) is a widely cultivated annual crop in temperate zones. This nutritionally dense non-model crop contains high levels of starch, protein, flavonoids, and dietary fibre in the grain.5 Furthermore, buckwheat flour is gluten-free and can replace wheat flour in a coeliac diet.6 Buckwheat, however, has two major defects as a crop. First, its outcrossing nature, caused by heteromorphic self-incompatibility (SI), makes it difficult to produce pure cultivars of buckwheat and to fix useful traits. Second, buckwheat grains contain allergens, which induce anaphylactic reactions in some people.7 Improving the nutritional quality of the grain and removing genes responsible for SI and allergens are important breeding objectives in buckwheat, and various genetic molecular marker systems have been developed for this purpose [e.g. amplified fragment length polymorphism (AFLP) markers,8 SSR markers,9 expressed sequence tag (EST) markers,10 and array-based markers11]. However, AFLP markers have not yet been converted to single locus markers in the buckwheat genome. SSR markers have limited utility in buckwheat due to difficulty in amplifying specific loci because of the high level of genetic diversity between buckwheat cultivars, and EST marker systems do not span the entire genome. The newest genome map of buckwheat constructed using array-based markers has sufficient markers to cover the entire genome; however, it requires a specialized instrument to interpret the fluorescence signals of the arrays.11 Recently, a versatile NGS-based genotyping method with a low-cost, genotyping-by-sequencing (GBS) marker system was developed.12 The GBS system utilizes redundant libraries constructed with PCR fragments that have recognition sites of two kinds of restriction enzymes on both ends. The PCR fragments sequenced using NGS technology are mapped to reference sequences for genome-wide genotyping. The GBS system has been used to genotype various crop species to date.13 A draft genome of buckwheat could be used as a reference sequence for developing GBS markers to identify genes that control desirable breeding traits.
Here, we used NGS-based technology to sequence the buckwheat genome, and constructed the Buckwheat Genome DataBase (BGDB; http://buckwheat.kazusa.or.jp). This database can be used for the rapid detection of homologues of genes previously identified in other plants, and we present three examples of buckwheat genes identified using this approach, i.e. genes controlling flavonoid biosynthesis and genes encoding 2S albumin-type allergens and granule-bound starch synthases (GBSSs). Furthermore, to illustrate that the draft genome can be used as a reference sequence for NGS-based genotyping, we used GBS technology to identify novel candidate genes for controlling heteromorphic SI of buckwheat.
A single buckwheat plant with short-styled flowers, a descendant of material used in a previous study to construct a buckwheat BAC library,14 was obtained from sib-crossing (BC1F6). Nuclei were extracted from leaf tissues of the single plant as described previously.14 Subsequently DNA was extracted from the nuclei according to a previously described method.11 To construct a training set for gene prediction using Augustus 3.0.3,15 total RNA was prepared from the anthers of short-styled and long-styled plants, cv ‘KOTO’ using a previously described method.16
A paired-end (PE) library with insert sizes of 180–200 bp and a mate-pair (MP) library with insert sizes of 3, 5, 10, and 20 kb were constructed from nuclear DNA according to the manufacture's protocol (Illumina Inc., San Diego, CA, USA). A PE RNA-Seq library with insert sizes of ~275 bp was also constructed. Sequencing of genomic and RNA-Seq libraries using Illumina HiSeq 2000 was respectively carried out at Hokkaido System Science Co., Ltd and Beijing Genomics Institute. The PE and MP reads were subjected to quality trimming by PRINSEQ 0.20.4,17 and further to adaptor trimming by the fastx_clipper program in the FASTX-toolkit 0.0.14 (http://hannonlab.cshl.edu/fastx_toolkit). The quality value threshold used for quality trimming was 10 from the 3′ terminal, and the adaptor sequence used was ‘AGATCGGAAGAGC’. Then, for the PE library with insert sizes of 180 bp, one base at the 3′ terminal was trimmed from all reads due to low quality, and PE reads shorter than 99 bp and including undetermined nucleotides (Ns) were excluded. For the MP library with insert sizes of 3, 5, and 10 kb, reads shorter than 49 bp and including Ns were excluded, and the 50 bp from the 5′ terminal were used for scaffolding. For the MP library with an insert size of 20 kb, reads shorter than 99 bp and including Ns were excluded, and the 50 bp from the 3′ terminal were used for scaffolding. For the PE RNA-Seq data, reads shorter than 89 bp and including Ns were excluded. The trimmed reads were used for further analyses.
For genome size estimation, we used PE reads with a k-mer size of 17, as successfully used in a previous study.18 The k-mer distribution was investigated using Jellyfish 126.96.36.199 The genome size and coverage (i.e. the number of base pairs sequenced as a multiple of the number of base pairs present in the genome) were estimated using the peak at 47 on the k-mer frequency distribution curve (Supplementary Fig. S1) according to a previously described method.18
The trimmed PE reads were assembled using SOAPdenovo2 rev24020 with k-mer sizes of 61, 71, 81, and 91 nt. The option used was –RF –M 1–K [k-mer size]. After the assembly, gaps in scaffolds were closed using GapCloser 1.10 (http://soap.genomics.org.cn/soapdenovo.html) (P= 31). The trimmed MP reads were used for scaffolding by SSPACE2.021 with parameters –k 5 –x 0 –g 3 –a 0.7. Sequences homologous to bacteria, fungi, and human (hg19) genome sequences, vector sequences from UniVec (http://www.ncbi.nlm.nih.gov/tools/vecscreen/univec/), chloroplast (accession number: NC_000932.1), and mitochondrial (accession number: NC_001284.2) genome sequences from A. thaliana, and the PhiX sequence used in Illumina sequencing by BLASTN22 searches with an E-value cut-off of 1E−10 and length coverage of ≥10% were excluded as probable contamination. Finally, scaffolds longer than 300 bp were selected and designated FES_r1.0. Repetitive sequences in FES_r1.0 were detected using RepeatScout 1.0.523 and RepeatMasker 4.0.3 (http://www.repeatmasker.org) according to a previously described method.18
The RNA-Seq reads were mapped onto the draft genome sequence (FES_r1.0) with TopHat 188.8.131.52 The bam2hints program installed in Augustus 3.0.3 was used to generate the intronhints.gff file, and Cufflinks was used to reconstruct transcripts in an exonhints.gff file. The two gff files were merged to form a HINTS file. The HINTS file was used as the buckwheat training set to predict genes for Augustus 3.0.3 (Method 1). Furthermore, genes were predicted using Augustus 3.0.2 (Method 2) and geneid 1.4.425 (Method 3) with an A. thaliana training set. Finally, the genes predicted by the three methods were merged.
The merged genes were subjected to similarity searches against NCBI's NR database (ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz) and amino acid sequences of A. thaliana from TAIR10 (https://www.arabidopsis.org) using BLASTX with an E-value cut-off of 1E−10. The top hit was used to assign the product name. BLAST searches against UniProt (TrEMBL + Swiss-Prot) with an E-value cut-off of 1E−20 were also carried out. A domain search against InterPro (http://www.ebi.ac.uk/interpro/) was conducted using InterProScan26 with an E-value cut-off of 1.0. Finally, genes were classified based on NCBI's euKaryotic clusters of Orthologous Groups (KOG) database27 by performing BLAST searches with an E-value cut-off of 1E−4.
Genes related to transposable elements were inferred based on a BLAST search against the NCBI's NR database, and conserved domains were identified based on a search against InterPro and GyDB 2.028 using hmmsearch in HMMER 3.029 with an E-value cut-off of 1.0. Transfer RNA genes (tRNAs) were predicted using tRNAscan-SE v.1.23.30 Ribosomal RNA genes (rRNAs) were predicted in BLASTN searches with an E-value cut-off of 1E−10 using A. thaliana 5.8S and 25S rRNAs (accession number: X52320.1) and 18S rRNA (accession number: X16077.1) as queries.
The draft genome sequence (FES_r1.0), predicted gene sequences, deduced amino acid sequences, annotations derived from BLAST searches against the TAIR10 and NCBI's NR databases, and domains identified in the search against InterPro were included in the BGDB. In addition, local BLAST searches and keywords searches for gene names and their annotations were also implemented in the BGDB.
Total DNA extracted from 18 short-styled and 18 long-styled buckwheat landraces from around the world (Supplementary Table S1) was used for GBS analysis. GBS was carried out according to Elshire et al.,12 except that EcoRI and MseI were used as restriction enzymes. Barcode adaptors are listed in Supplementary Table S2. Barcode-labeled amplicons were sequenced by Illumina HiSeq 2000 at Hokkaido System Science Co., Ltd. The PE reads were subjected to quality and adaptor trimming by Trimmomatic 0.3.2.31 The quality value threshold used for quality trimming was 25 with a window size of 5, and the adaptor sequences used were ‘CACGACGCTCTTCCGATCT’ and ‘ACCGCTCTTCCGATCTGTAA’. Then, PE reads longer than 39 bp were aligned to reference sequences (FES_r1.0) using BWA 0.7.9,32 and the mapping results were processed with SAMtools 0.1.18.33 To minimize mismatching bases across all the reads, local realignment procedure was carried out using RealignerTargetCreator and IndelRealigner in GATK 3.4.34 All the sites on reference sequences that mapped with reads were extracted and combined in a variant call format file using the UnifiedGenotyper in GATK 3.4 with the option of -out_mode EMIT_ALL_CONFIDENT_SITES. Sites at which >50 reads were mapped in long-styled plants but not in short-styled plants were defined as ‘non-SS’. Likewise, sites at which >50 reads were mapped in short-styled plants but not in long-styled plants were defined as ‘non-LS’. Then, the number of short-styled plants with mapped reads at each non-LS site and the number of long-styled plants with mapped reads at each non-SS site were counted. Non-LS sites shared by >10 short-styled plants were regarded as S-allele-specific sites (see Results and discussion). Then, scaffolds harbouring >39 S-allele-specific sites were regarded as S-allelic scaffolds.
The k-mer frequency distribution curve (k-mer = 17) using PEs with 180 and 200 bp is shown in Supplementary Fig. S1. Based on this curve, the genome size of buckwheat was estimated to be between 1,212,021,130 and 2,424,042,260 bp using peaks at a multiplicity of 94 (coverage = 111.9) and 47 (coverage = 56.0), respectively. The genomic DNA used in this study is expected to contain heterozygous regions, due to the outcrossing nature of buckwheat; however, we used sib-mating descendant plants as material to reduce heterozygous genomic regions. The haploid genome size of 1.2 Gb calculated based on the major peak (multiplicity = 94) is almost the same as that estimated from cytometry analyses (1.34 Gb).39
The numbers of raw and trimmed reads are summarized in Supplementary Table S3. The trimmed reads with k-mer sizes of 61, 71, 81, and 91 nt were assembled using SOAPdenovo2. The N50 values of the assemblies using k-mer sizes of 61, 71, 81, and 91 nt were, respectively, 1,388, 1,419, 1,350, and 770 bp. The longest scaffolds, i.e. those assembled with a k-mer size of 71, were used for further analysis. Gaps in the contigs were closed using GapCloser 1.10 (http://soap.genomics.org.cn), and mate-pair reads were used for scaffolding in SSPACE2.0. The 2,693,661 scaffolds that were shorter than 299 bp and the 1,908 scaffolds that exhibited signs of contamination (identified in a BLAST search) were excluded, and the remaining 387,594 scaffolds were designated as the draft genome sequence, FES_r1.0 (Table 1). The total length of FES_r1.0 was 1,177,687,305 bp, and the N50 length was 25,109 bp. The scaffolds were named ‘Fes_sc’ followed by a six-digit identifier and the sequence version (e.g. Fes_sc000001.1).
Considering the genome size, the total length of the assembled genome sequence was close to the estimated size; therefore, the draft genome sequence (FES_r1.0) was considered as the haploid genome sequence. The draft genome sequence spanned 98.3% of the genome size estimated in the k-mer frequency distribution analysis, and 88.1% of that estimated by flow cytometry analysis.39 The high coverage rates, low N50 value (25,109 bp), and large total number of scaffolds (387,594) obtained in the present study may be due to the high proportion of heterozygous genomic regions present in the plant materials used, as indicated by the k-mer frequency distribution curve (Supplementary Fig. S1). Long-read data generated by PacBio RS was found to increase N50 and reduce the total number of scaffolds in the draft genome of Primula veris (cowslip), a heterozygous plant.40 A study of Raphanus sativus (radish)41 indicated that the lengths of scaffolds assembled based only on short-read data were drastically increased by constructing super-scaffolds with SSPACE2.0 using BAC-end sequences. We have already constructed a BAC library14 using parental plants of the material used in this study. Long reads generated by PacBio RS and BAC-end sequencing will be used to expand the scaffold length of the present draft genome of buckwheat.
Gene predictions were performed using Augustus 3.0.3 with the buckwheat training set (Method 1), Augustus 3.0.2 with the A. thaliana training set (Method 2), or geneid with the A. thaliana training set (Method 3), and the results obtained using the three methods (Methods 1–3) are summarized in Supplementary Table S4. If the genes were located at the same locus when using Methods 1–3, the longest gene was selected. After the results were integrated, the total length of the CDSs (FES_r1.0_cds) was 212,917,911 bp composed of 286,768 CDSs, and N50 was 1,101 bp. The gene name was prefixed with a six-digit identifier followed by the prediction method and scaffold number (i.e. auf: Augustus 3.0.3, buckwheat training set, Method 1; aua: Augustus 3.0.2, A. thaliana training set, Method 2; and gia, geneid 1.4.4, A. thaliana training set, Method 3), as in the following example: Fes_sc0012271.1.g000001.aua.1.
Genes related to transposable elements (TEs) were inferred according to BLAST searches against the NCBI's NR database (Supplementary Table S5). The total length of known repeats was 133,362,886 bp (11.32% of FES_r1.0, i.e. 1,177,687,305 bp) and Class I long terminal repeat (LTR) elements were frequently found (8.79% of FES_r1.0). We identified unique repeats that had not previously been sequenced in this analysis, and these had a total length of 475,367,120 bp and accounted for 40.36% of FES_1.0. Genes annotated as transposons were tagged ‘TE’ in the database.
Based on BLAST searches against the NR database, the genes were tagged as intrinsic (including both of a start codon and stop codon), partial (including a start codon or stop codon, or lacking both start and stop codons), pseudo (pseudogenes; including a stop codon in the coding region), and short (<49 amino acids). Based on BLAST searches against UniProt (TrEMBL + Swiss-Prot) with an E-value cut-off of 1E−20, the genes were tagged ‘f’ (hit region of ≥70% in query length), ‘p’ (hit region of <70% in query length), or ‘n’ (no hits against UniProt). The tags assigned to the genes are listed in Supplementary Table S6; 35,816 predicted genes tagged as full length or partial were annotated by BLAST searches against NR and/or UniProt. The predicted genes were classified based on NCBI's KOG database for F. esculentum, Beta vulgaris (which is classified in the same order, Caryophyllales, as F. esculentum), and A. thaliana (the most advanced model plant). KOGs for the predicted genes in the three species were assigned to the 25 functional categories, which were classified into four large groups (Groups 1–4; Supplementary Table S7), and the percentage of KOGs in each category was calculated for each plant species. Figure 1 shows the percentage of KOGs in Groups 1–3. Note that the number of KOGs assigned to the functional categories of N (cell motility) in Group 2 was <10 for all three species and was excluded in Fig. 1. The distribution of KOGs in each large group was similar among these species. For instance, KOGs from all three species were enriched in categories K, O, and G, and were poor in categories B, W, and H. In addition to the protein-coding genes, we identified 1,374 genes for tRNAs in total, and the numbers of genes for each tRNA are summarized in Supplementary Table S8.
Consequently, we were able to identify >35,000 annotated genes including genes for tRNA. Many of them were classified into similar functional categories as in the other two plant species. The BGDB constructed in this study on the basis of FES_r1.0 is expected to serve as a useful tool for identifying genes to develop buckwheat cultivars with improved agronomic properties. The BGDB is available from the Kazusa DNA Research Institute (http://buckwheat.kazusa.or.jp). To demonstrate the utility of this database, we then conducted four test analyses focusing on agronomically important genes.
Buckwheat contains several kinds of flavonoids, such as flavonols, proanthocyanidins, and anthocyanins. The flavonol rutin is present at high levels in buckwheat seeds42 and seems to be beneficial for human health. Several genes encoding enzymes related to flavonoid biosynthesis in buckwheat have been reported43,44 and are presumed to be regulated by transcription factors (TFs), such as MYB, bHLH, and WD40, as in other plant species.45–48 However, little is known about such TFs in buckwheat. The R2R3-MYB TFs are thought to play central roles in plant-specific processes, based on their specific gene expression patterns.49–51 To provide an overview of genes that regulate plant-specific processes, including flavonoid synthesis in buckwheat, we searched for candidate genes encoding R2R3-MYB TFs using the BGDB.
By conducting a keyword search using the term ‘MYB’, we identified 274 genes predicted to encode MYB TFs. From these, we excluded partial sequences, pseudogenes, and genes that did not contain fully conserved R2R3 regions. The remaining 71 putative R2R3-MYB TFs obtained from the database are listed in Supplementary Table S9. Phylogenetic analyses based on the R2R3 domain often reveals functionally characterized groups that are present in a wide range of plant species.52,53 In the present study, six putative R2R3-MYBs were assigned within known functional groups consisting of representatives from other plant species (Supplementary Fig. S2). Though functional analyses would need to be conducted to determine the role of each gene, this finding shows that R2R3-MYB genes, which likely have different roles, can successfully be obtained from the BGDB.
To initiate the transcription of genes encoding enzymes in the flavonoid biosynthetic pathway, a TF such as MYB or the MYB-bHLH-WD40 (MBW) complex must bind to TF binding sites (TFBSs) in the promoter region of each gene. Mutation of TFBSs alters the expression of genes.54,55 Therefore, MYB TFs as well as the TFBSs of target genes can be manipulated to improve flavonoid production. To identify promoter sequences in a non-model plant species, genome walking, which is time-consuming and expensive, would usually be performed. In this study, we tried to identify the promoter sequences of genes in the flavonoid biosynthetic pathway and estimated the TFBSs using PLACE (http://www.dna.affrc.go.jp/PLACE/). cDNA sequences of nine genes in the pathway have been already registered in GenBank. For three of these genes (chalcone isomerase, CHI; flavonoid 3′-hydroxylase, F3′H; anthocyanidin synthase, ANS), we successfully determined the 1,000–2,000 bp upstream region after a BLASTN search against the BGDB. DNA motifs relating to the MYB or MBW complex predicted by PLACE analysis are shown in Supplementary Fig. S3. These results can be confirmed by molecular techniques such as gel-shift assays, as reported for other plant species (e.g. apple,56 persimmon,53 and soybean57). The BGDB is thus a powerful tool for isolating promoter sequences and accelerating molecular-based analyses of TFs. In this study, however, we could not identify sufficiently long promoter regions (over 1,000 bp) for the remaining six genes, mainly because of gaps between contigs. Gap closing using long reads generated by a PacBio sequencer will greatly improve the ability to search for the promoter regions of target genes.
Buckwheat seeds contain allergens. For instance, Fag e 2 (16 kDa protein) is a pepsin-resistant 2S albumin that causes an immediate allergic reaction.58 Although Fag e 2 cDNA has been sequenced,58 no further genomic information is available. Efforts to develop hypoallergenic buckwheat and establish inspection techniques to minimize allergen contamination in food products require detailed genomic information of Fag e 2.
A BLAST search of Fag e 2 (accession number: DQ304682) among the predicted proteins in the BGDB yielded one identical gene (Fes_sc0000087.1.g000014.aua.1) and four homologues (Fes_sc0000087.1.g000011.aua.1, Fes_sc0000087.1.g000013.aua.1, Fes_sc0000087.1.g000028.aua.1, and Fes_sc0007211.1.g000003.aua.1) (Table 2). The results of a BLAST search against the NCBI's NR database indicated high similarities of four homologues with previously reported allergens of buckwheat. As shown in Fig. 2, the predicted amino acid sequences of Fes_sc0000087.1.g000011.aua.1 and Fes_sc0000087.1.g000028.aua.1 showed high levels of similarity (97 and 98%, respectively) with the buckwheat 8 kDa allergen, which is a member of the 2S-albumin multi-gene family.59 In contrast, the predicted proteins of Fes_sc0007211.1.g000003.aua.1 and Fes_sc0000087.1.g000013.aua.1 did not have high levels of amino acid sequence similarity (43 and 79%, respectively) with known buckwheat allergens. Fes_sc0007211.1.g000003.aua.1 had a 55-amino acid deletion at the N terminal and lacked four of eight characteristic Cys residues present in Fag e 2 and 2S albumin family.60 Thus, the gene product is not likely to be allergenic. On the other hand, Fes_sc0000087.1.g000013.aua.1, which shows similarity with Fag e 2, might be a novel allergen, because its predicted protein retained the eight characteristic Cys residues. Further immunoblotting analysis will clarify whether or not the protein encoded by Fes_sc0000087.1.g000013.aua.1 is allergenic. It is notable that four genes that retain conserved Cys residues are estimated to be located within a genomic region of 108 kb on single scaffold (Fes_sc0000087.1). Particularly, the Fag e 2 gene (Fes_sc0000087.1.g000014.aua.1), and the Fes_sc0000087.1.g000011.aua.1 and Fes_sc0000087.1.g000013.aua.1 homologues, which have similarities with buckwheat allergens, are located within a 17 kb region of the scaffold. Therefore, this region would be a prime candidate for silencing in studies aimed at producing hypoallergenic buckwheat.
Starch, which is present in many plant seeds, legumes, and tuber crops, is an important part of the human diet and has industrial applications. Starch contains two types of glucose polymer, i.e. amylopectin and amylose.61 Amylopectin is the major component of starch (60–90%),62 whereas the amylose content varies among plant species.63 Since the amylose content affects the properties of starch, modulating the amylose content has been an important breeding objective in crops.64 GBSS catalyses amylose synthesis.65 Thus, we searched for genes encoding starch synthases (SSs) from the BGDB, and aimed to identify GBSS genes using phylogenetic approaches.
To identify buckwheat GBSS genes, we conducted a keyword search using ‘starch synthase’ and obtained 42 hits. We then filtered these hits using modified five conserved sequence motifs proposed by Cao et al.66 [i.e. P(2)K(1)GGL(1)D(4)L, VS(5)E, G(2)NG(7)P(2)D, R(3)QKG, D(5)S(2)EPC(1)L(1)Q(5)YG(8)GGL (numbers in parentheses represent numbers of amino acids)]. Of the eight resulting sequences, one was excluded, as it was annotated as a pseudogene. The seven remaining sequences were each derived from a different scaffold.
To assign these seven putative SSs of buckwheat to previously proposed phylogenetic groups,67 we performed NJ analysis using the deduced amino acid sequences of SSs from various plant species. A keyword search using ‘starch synthase’ was also conducted in Phytozome 10.3 (http://phytozome.jgi.doe.gov/) analysing 36 angiosperm species (Supplementary Table S10). Of the 27,852 sequences identified, 238 sequences remained after the same filtering procedure as mentioned earlier and were subjected to phylogenetic analysis. Two GBSS sequences of Fagopyrum species, one from F. esculentum deposited at the EMBL/GENBANK/DDBJ (accession number: HW041459) and the other from F. tataricum (AHA36967.1),68 were also included in the analyses. A NJ-tree based on the alignment was suggested to contain the following five known phylogenetic groups: SSI, SSII, SSIII, SSIV, and GBSS (Supplementary Fig. S4). The SS sequences obtained from the BGDB belonged to four of the five classes. This suggests that the BGDB can be used to identify agronomically important genes. However, the previously deposited GBSS (HW041459) sequence is not identical to any of the seven sequences identified in the BGDB, and a BLASTN search using the coding region of HW041459 as a query detected only partially identical sequences over 360-bp and three scaffolds (Fes_sc0195744.1, Fes_sc0059460.1, and Fes_sc0005470.1). This is a shortcoming of the short scaffold size of the assemblies in the BGDB.
The GBSS clade contained four phylogenetically distinguishable sequences in total: HW041459 and three from BGDB. To clarify the detailed phylogenetic relationship among these four GBSS genes, we performed NJ analysis based on 71 aligned amino acid sequences of GBSS including those from Physicomitrella patens as outgroup (Fig. 3). The copy number of GBSS genes varies in plants; two diverged groups exist in the rosids69 and several copies of GBSS in buckwheat seem to also have diverged, at least in two lineages. In the cladogram, a GBSS sequence, Fes sc0004292.1.g000004, clustered with a GBSS sequence from F. tataricum, which belongs to the same genus. GBSS of F. tataricum was confirmed to be expressed in the endosperm,68 thus Fes sc0004292.1.g000004 is likely to be expressed in the endosperm too. If more than one GBSS is active within endosperms, the amylose content of buckwheat flour can be controlled by altering their copy number. In hexaploid wheat, Yamamori and Quynh70 evaluated the dosage effects of three GBSS genes. Loss of function of these genes is expected to differentiate the starch properties of the grain; moreover, distinct proportions of amylose might be produced according to copy number of active GBSS genes. Studies are underway to examine the expression of the three buckwheat GBSS genes identified here, and also the previously identified one (HW041459).
Finally, we screened for candidate genes that control heteromorphic SI in buckwheat. Buckwheat is a heteromorphic self-incompatible crop with dimorphic flowers (i.e. short-styled and long-styled flowers). Short-styled flowers have short styles and long stamens, whereas long-styled flowers have long styles and short stamens. The SI response is expressed between plants bearing the same flower morph, but not between plants bearing different flower morphs. Flower morph and SI response are determined by a diallelic system at the SELF-INCOMPATIBILITY supergene complex locus (S locus); S/s heterozygotes and s/s recessive homozygotes bear short-styled and long-styled flowers, respectively.71,72 Recently, S-LOCUS EARLY FLOWERING 3 (S-ELF3), which controls the short-styled phenotype of buckwheat, was isolated.16 Furthermore, it was suggested that recombination is strongly suppressed around S-ELF3.16 Based on these findings, we predicted that S-allelic scaffolds exist in which heteromorphic SI-related genes other than S-ELF 3 are located. Thus, we tried to detect S-allelic scaffolds in the draft genome and to identify novel candidate genes involved in the SI response in these.
To obtain S-allele-linked scaffolds from the draft genome, we used GBS reads obtained from each of 18 short-styled and 18 long-styled landraces of buckwheat originating from various countries (Supplementary Table S1). Briefly, GBS reads from 36 plants were mapped to scaffolds of >1,000 bp, and we subsequently extracted the non-LS sites (i.e. sites not present in all the long-styled plants) in which no reads from all of the 18 long-styled plants were mapped. The number of mapped reads for each plant ranged from 1,620,260 to 2,821,338 (Supplementary Table S11). The number of mapped reads per long-styled plant was not significantly different from that per short-styled plant (P-value is 0.536, t-test). We determined how many short-styled plants share each non-LS site, and the number ranged from 1 to 18 (Fig. 4). As a control, non-SS sites (i.e. sites not present in short-styled plants) were also counted as above, and the number of long-styled plants sharing non-SS sites ranged from 1 to 10 (Fig. 4).
As shown in Fig. 4, a striking U-shaped distribution was obtained when the number of non-LS sites was plotted against the number of short-styled plants sharing the non-LS sites. In contrast, non-SS sites shared by >10 long-styled plants were not detected at all. Considering that there was no significant difference in the number of mapped reads obtained from short-styled and long-styled plants, we regarded the non-LS sites shared by >10 short-styled plants as ‘S-allele-specific sites’. In total, 88,031 of the S-allele-specific sites were detected and found to be located on the S-allelic region, consisting of 332 of scaffolds encompassing 5,393,196 bp (Supplementary Table S12). Of the 332 scaffolds, Fes_sc0003500.1 and Fes_sc0015090.1 harbour S-ELF3 and SSG2, respectively. Since we previously established that S-ELF3 and SSG2 existed only in the genomes of short-styled plants,16 this result shows the effectiveness of our procedure in discovering the S-allelic region in the buckwheat genome. However, it should be noted that we used only one restriction enzyme combination (EcoRI and MseI) to obtain the GBS reads; we did not detect S-allelic scaffolds harbouring no recognition sites for these enzymes. Thus, the total length of the S-allelic region in the buckwheat genome obtained in the present study might be an underestimation.
In the S-allelic region, repeat sequences were abundant; the ratio of repeat length to the total length of the S-region was 71.43%, which is 1.4-times higher than that of the ratio of repeat length to the total scaffold length (51.69%, Supplementary Table S5). Gypsy elements were particularly abundant in the S-region; the Gypsy elements accounted for 12.15% of the S-region, which is 1.9 times higher than that of the repeat length in the total scaffold length (6.41%, Supplementary Table S5). Excluding TEs, only 32 predicted genes were successfully annotated by our database analyses in the 332 scaffolds (Supplementary Tables S12 and S13). Among the 32 predicted genes, two were candidates for heteromorphic SI related genes; Fes_sc0024869.1.g000002 and Fes_sc0006594.1.g000005 were predicted to encode proteins with similarity to a RING/U-box superfamily protein and an exoribonuclease 4, respectively. The RING/U-box protein is an E3 type ubiquitin ligase that functions in the 26S proteasome.73 It has been suggested that degradation of cytotoxic S-RNases by the 26S proteasome in pollen tubes occurs during compatible pollination in a self-incompatible species, Petunia inflata.74 Ongoing studies currently evaluating the expression pattern of these two genes and the effect of mutations in these two genes will clarify their roles in heteromorphic SI of buckwheat.
The genome size of buckwheat is relatively large (~1.2 Gb), and some genomic regions are expected to be in the heterozygous state due to its outcrossing nature. These factors would reduce the lengths of our scaffolds in the buckwheat draft genome, which was assembled using only Illumina short reads. Though the draft genome was trancated and divided into a large number of scaffolds (387,594 scaffolds), we have successfully identified genes that control agronomically important traits using gene predictions and subsequent annotations in the BGDB. Furthermore, we also identified novel candidate genes involved in heteromorphic SI of buckwheat using the draft genome as a reference sequence for GBS mapping. Even if the scaffolds in a draft genome are truncated, they can nonetheless be used for database construction and as a reference sequence for NGS-based genetic markers. We are now preparing induced mutant pools of buckwheat using heavy-ion beams and chemicals such as ethyl methanesulfonate. Using NGS-based multi-dimensional screening,75 mutants of the genes identified in this study will be rapidly identified from the pool and will be used to develop superior varieties of buckwheat.
The Illumina reads used in this study are available from DDBJ/EMBL/NCBI under the accession numbers listed in Supplementary Table S3. The DRA accession number of the Illumina reads used in GBS analysis is DRA004489. The scaffold sequences are available under the accession numbers BCYN01000001-BCYN01387594 (387,594 entries). The draft genome sequence FES_r1.0, CDS and protein sequences, and annotation file (gff file) are also available from the Buckwheat Genome DataBase (BGDB; http://buckwheat.kazusa.or.jp).
This study was supported by a grant from the Ministry of Agriculture, Forestry, and Fisheries of Japan (Genomics-based Technology for Agricultural Improvement, SFC3001) and JSPS KAKENHI (grant numbers 22580003, 25292009, 25450011, and 25660006).
We thank S. Tabata, T. Ota, K. Isono, and the anonymous reviewer for valuable suggestions and comments; R. Kajitani for kind information on genome assemblers; and K. L. Farquharson for language-editing support of the manuscript. We also thank Y. Nakajima for assistance in database construction and management. Computations were partially performed on the NIG supercomputer at ROIS National Institute of Genetics.