|Home | About | Journals | Submit | Contact Us | Français|
More than 70 years after the first ex situ genebanks have been established, major efforts in this field are still concerned with issues related to further completion of individual collections and securing of their storage. Attempts regarding valorization of ex situ collections for plant breeders have been hampered by the limited availability of phenotypic and genotypic information. With the advent of molecular marker technologies first efforts were made to fingerprint genebank accessions, albeit on a very small scale and mostly based on inadequate DNA marker systems. Advances in DNA sequencing technology and the development of high-throughput systems for multiparallel interrogation of thousands of single nucleotide polymorphisms (SNPs) now provide a suite of technological platforms facilitating the analysis of several hundred of Gigabases per day using state-of-the-art sequencing technology or, at the same time, of thousands of SNPs. The present review summarizes recent developments regarding the deployment of these technologies for the analysis of plant genetic resources, in order to identify patterns of genetic diversity, map quantitative traits and mine novel alleles from the vast amount of genetic resources maintained in genebanks around the world. It also refers to the various shortcomings and bottlenecks that need to be overcome to leverage the full potential of high-throughput DNA analysis for the targeted utilization of plant genetic resources.
Plant breeding needs to focus on traits with the greatest potential to increase yield under changing climate conditions . Agricultural practices have gradually displaced local traditional varieties and crop wild relatives, leading to a dramatic loss of indigenous biodiversity. Tapping into the rich genetic diversity inherent in a crop species and their wild relatives is a prerequisite for germplasm improvement in the future [2–7; http://www.fao.org]. Hence, new technologies must be developed to accelerate breeding through improving genotyping and phenotyping methods and by accessing the available genetic diversity stored in genebanks around the world.
Prior to the advent of molecular characterization, accessions in germplasm collections were mainly examined based on morphological characters and phenotypic traits . The development of molecular techniques now allows a more accurate analysis of large collections. High-throughput (HT) technologies including DNA isolation, genotyping, phenotyping and next-generation sequencing (NGS) provide new tools to add substantial value to genebank collections. The integration of genomic data into genebank documentation systems and its combination with taxonomic, phenotypic and ecological data will usher in a new era for the valorization of plant genetic resources (PGR). From the determination of phenotypic traits to the application of NGS to whole genomes, every aspect of genomics will have a great impact not only on PGR conservation, but also on their utilization in plant breeding .
Identification and tracking of genetic variation has become so efficient and precise that thousands of candidate genes can be tracked within large genebank collections . Using NGS technologies, it is possible to resequence candidate genes, entire transcriptomes or entire plant genomes more efficiently and economically than ever before. Advances in sequencing technology will allow for whole-genome resequencing of hundreds of individuals. In this way, information on thousands of candidate genes and candidate regions can be harnessed for thousands of individuals to sample genetic diversity within and between germplasm pools, to map Quantitative Trait Loci (QTLs), to identify individual genes and to determine their functional diversity. In this review, we outline some important developments in this field, where NGS technologies are expected to enhance the value and thus the usefulness of genebank collections.
PGR include cultivars, landraces, crop wild relatives and mutants. The loss of genetic diversity in many crop plants has resulted in efforts to collect PGR which were initiated by Vavilov early in the 20th century aiming at supporting plant breeders with genetic material to extend genetic variability, as a basis to create new crop varieties . A wealth of germplasm collections is available worldwide, with more than 7 million accessions held in over 1.700 genebanks (http://www.fao.org/docrep/013/i1500e/i1500e00.htm). These do not evenly cover all crop species but are highly biased regarding their agricultural importance. About 50% of the global ex situ germplasm is made up by only 10 crop species with the three largest collections (wheat, rice and barley) representing 28% of the global germplasm (Figure 1). Passport and genotypic data suggest that collections include different degrees of duplications resulting in ~1.9–2.2 million distinct accessions with the remaining being duplicates (http://www.fao.org/docrep/013/i1500e/i1500e00.htm). Proper conservation of PGR along with the development of best genebank practices and pomoting the effective use is vital for food security in the future . However, ex situ conservation is rather fragmented, largely because it is mainly based on national programs and scattered institutional efforts. For instance, barley (Hordeum vulgare L.), is maintained in more than 200 collections worldwide amounting to approximately 470000 accessions . Other crop species follow similar patterns . Despite manifold efforts to coordinate genebank activities conservation is still inefficient in many places and suffers from variable or even lacking standards, unreliable access and poor characterization and documentation of the material . Ex situ germplasm collections for crop wild relatives are rather limited in size due to the difficulties in maintaining non-domesticated plants . Introgression from wild to cultivated germplasm and vice versa both during seed multiplication in genebanks as well as in the wild pose a problem for proper maintenance and correct classification of the material, which usually is based on few morphological characters only. Another problem is that genebank accessions, even if they represent inbreeding crop species, often are genetically heterogeneous and may show residual heterozygosity. While this may reflect the original genetic state, e.g. of a landrace accession, it seriously can impair its molecular characterization and its subsequent use for research and breeding. Thus, most core collections are made up of accessions which underwent purification by single seed descent (SSD).
Systematic phenotypic analysis of genebank collections is a time and resource intense effort which has been mainly restricted to agronomic traits that show a high heritability and can be assessed based on the per se performance of an accession. Therefore, most evaluation efforts were focused to combine i.e. disease resistance and important morphological characters (yield components) [8, 17]. Deep genetic and phenotypic characterization of genetic resources by HT techniques, including resequencing of enriched candidate genes and low-coverage full-genome resequencing will increasingly become available. Concomitantly large amounts of data need to be integrated within the current documentation systems. Genebanks have to prepare for entering the genomics era by developing new strategies and novel information tools to assess the genetic diversity represented in their collections. Although there have been some successful examples of extracting useful genes from genebanks, the vast potential of this resource still remains largely untapped [18, 19].
A large series of studies have been undertaken to study diversity, domestication, evolution and phylogeny of PGR, largely selected from genebank collections. Early studies considered morphological and cytogenetic characters. Various other techniques and molecular markers have been applied subsequently [20–23]. Until recently, amplified fragment length polymorphism (AFLP) or simple sequence repeats (SSR) were the molecular markers of choice for DNA fingerprinting of crop genomes [24–26]. Owing to their amenability to systematic development and HT detection, SNP markers increasingly applied to study genetic diversity in germplasm collections of up to several hundreds of accessions. Many of these collections have been established as association panels for linkage disequilibrium (LD) mapping, thus providing a first link between phenotypic and genotypic data sets. The corresponding accessions have been selected from various germplasm sources or breeding programs to represent a rough cross section of the overall genetic diversity available for a given species or for an ecogeographical region [27, 28]. This is exemplified by a population comprising 224 spring barley accessions, which were selected from the Barley Core Collection, BCC  and complemented by additional accessions to cover the entire distribution range of this crop . More recently, about1500 spring barley landraces adapted to temperate climate conditions were selected among 22093 Hordeum accessions of the Federal ex situ genebank (IPK Gatersleben, Germany), based on their origin and morphology. The whole set has been genotyped by 43 SSR markers and analyzed for its genetic structure. While this is intended to usher in large-scale fingerprinting analysis of barley genebank accessions, the approach still falls short of providing informed molecular access to the entire collection. Different marker systems for genetic diversity studies and population parameters can be compared over a collection as recently shown by  who compared the performance of 42 SSR markers and 1536 SNP markers. The marker type of choice and the number of markers to be studied have to be adjusted for each species and project.
Plant accessions from wild or locally adapted landrace genepools conserved in genebanks contain a rich repertoire of alleles that have been left behind by the selective processes of domestication, selection and cross-breeding that paved the way to today's elite cultivars. These resources stored in genebanks remain underexplored owing to a lack of efficient strategies to screen, isolate and transfer important alleles. The most effective strategy for determining allelic richness at a given locus is currently to determine its DNA sequence in a representative collection of individuals. Large-scale allele mining projects for germplasm collections at the molecular level are needed as the one described for Pm3 in wheat. Bhullar et al.  first selected a set of 1320 bread wheat landraces from a virtual collection of 16089 accessions, using the focused identification of germplasm strategy (FIGS) and isolated seven new resistance alleles of the powdery mildew resistance gene Pm3. Similarly, a series of novel alleles have been detected for a recessive gene conferring virus resistance in barley [32, 33]. Further resequencing studies of candidate genes for agriculturally important traits have been published, however, from smaller collections and mostly without functional characterization [34–40].
Resequencing of candidate genes using Sanger sequencing has been applied to study phylogenetic relationships of crop plants, their domestication, evolution, speciation and ecological adaptation. Early studies resequenced a single locus or few loci in only few individuals per species [41, 42]. Reduced costs for Sanger sequencing using capillary instruments and 96-well formats facilitated multilocus studies in larger collections [43–51].
Large-scale NGS is now possible using platforms such as Illumina/GA, Roche/GS FLX, Applied Biosystems/SOLiD and cPAL sequencing [52, 53]. The declining cost of generating such data is transforming all fields of genetics . Many crop plant genomes are characterized by the vast abundance of repetitive DNA. For example, the genome of barley comprises >5 Gb of DNA sequence of which <2% can be accounted for by genes . Therefore, to avoid excessive sequencing of putatively non-informative, repetitive DNAs, reduced-representation sequencing techniques have been developed to home in on subset of the genome for sequencing [56, 57]. When combined with techniques for labeling reads (barcoding), DNA from many individuals can be analyzed in the same pooled sequencing reaction, and NGS provides an increasingly affordable means. These technologies are therefore becoming a standard choice for generating genetic data in fields such as population genetics, conservation genetics and molecular ecology. On the other hand, the deluge of sequence data they will entail the necessity to develop an appropriate IT infrastructure and new computational solutions [58–64].
Sequencing many individuals at low depth is another attractive strategy e.g. for complex trait association studies as shown by . While detailed analysis of a single individual typically requires deep sequencing, resequencing of many individuals allows drastic reduction of sequencing depth when combined with efficient genotype imputation to match for missing data. Genotype imputation has been used widely in the analysis of genome-wide association studies (GWAS) to boost power and to facilitate the combination of results across different studies using meta-analyses [66, 67].
We have not yet reached the point at which routine whole-genome resequencing of large numbers of crop plant genomes becomes feasible. Therefore, it is necessary to select genomic regions of interest and to enrich these regions before sequencing. Sequencing targeted regions of DNA (e.g. the exome or parts thereof) rather than complete genomes will be likely the preferred approach for most genomics applications including evolutionary biology, association mapping and biodiversity conservation . Sequencing targeted regions on massively parallel-sequencing instruments requires methods for concomitant enrichment of the templates to be sequenced. There are several enrichment approaches available, each with advantages and disadvantages [69–72]. Resequencing allows fingerprinting of many individuals without ascertainment bias which is inherent to some SNP marker systems [73–75].
As outlined above, targeted resequencing of hundreds of loci in genebank collections is already feasible. Yet, the costs for DNA extraction, complexity reduction and barcoding need to be brought down for systematic resequencing of genebank collections. In this context, large efforts have recently been made to automate protocols for massively parallel (re)sequencing and data analysis in order to match the increasing instrument throughput. These protocols that include e.g. large-scale automatic library preparation and size selection on robots  or fully automated construction of barcoded libraries —might be useful paving the way for automated NGS technologies to screen genebank collections .
Triggered by advancements in sequencing technologies, several crop genome sequences have been produced or are underway [79–82]. Once good quality levels have been achieved, these sequences will enable researchers to address all kinds of biological questions or to link sequence diversity accurately to phenotypes.
Rapid developments in NGS will soon make whole-genome resequencing in several individuals or targeted resequencing of large germplasm collections reality. This will help to eliminate an important difficulty in the estimation of LD and genetic relationships between accessions obtained in bi-allelic genotyping studies caused by ascertainment bias i.e. the presence of rare alleles [73, 83–85].
Based on the available Arabidopsis thaliana (L.) Heynh. genome sequence, Weigel and Mott  advocated a 1001 Genomes project for Arabidopsis. Several Arabidopsis lines have been sequenced since [87, 88]. First studies on whole-genome resequencing in crop species have been published for rice and maize [66, 89, 90].
Combined genetic approaches for species, where a complete genome sequence and millions of SNPs are available, have been performed. Such approaches that include e.g. large-scale genotyping, targeted genomic enrichment, whole-genome resequencing and GWAS have been addressed to identify allelic diversity, rare genetic variation, QTL and their functional characterization [91–96] or to identify selective sweeps of favorable alleles and candidate mutations that have had a prominent role in domestication .
SNPs are the most abundant form of genetic variation in eukaryotic genomes and are not a limiting factor anymore, also not for crop species with large genome sizes like barley . SNP markers are rapidly replacing SSRs or Diversity Arrays Technology (DArT)  markers because they are more abundant, reproducible, amenable to automation and increasingly cost-effective [100, 101]. SNP-based resources are presently being developed and made publicly available for broad application in crop research .
A high-quality genomic sequence as it is available for Arabidopsis and rice represents the ideal blueprint for resequencing and the identification of SNPs. But even for species with less complete genomic sequences such as barley and wheat [103, 104] or other species [105–109] NGS methods are valuable for genome-wide marker development, genotyping and targeted sequencing across the genomes of populations [110–112]. These new methods—which include e.g. reduced-representation libraries (RRLs) [113–115], complexity reduction of polymorphic sequences (CRoPS) [116, 117], restriction-site-associated DNA sequencing (RAD-seq)  and low-coverage sequencing for genotyping [119–121] are applicable for genetic analysis to non-model species, to species with high levels of repetitive DNA or to breeding germplasm with low levels of polymorphism—without the need for prior sequence information. These methods can be applied to compare SNP diversity within and between closely related plant species or within wild natural populations [122, 123].
The systematic characterization and utilization of naturally occurring genetic variation has become an important approach in plant genome research and plant breeding. So far, linkage mapping based on bi-parental progenies has proven useful in detecting major genes and QTLs [124, 125]. Although this approach has been successful in many analyses, it suffers from several drawbacks. LD or association mapping is an attractive alternative to traditional linkage mapping and has several advantages over classical linkage mapping i.e. using unstructured populations that have been subjected to many recombination events [126–128]. GWAS in diverse germplasm collections offer new perspectives towards gene and allele discovery for traits of agricultural importance and dissecting the genetic basis of complex quantitative traits in plants [129, 130]. However, GWAS require a genome-wide assessment of genetic diversity (preferably based on a reference genome sequence and resequenced parts thereof), patterns of population structure, and the decay of LD. For this, effective genotyping techniques for plants, high-density marker maps, phenotyping resources, and if possible, a high-quality reference genome sequence is required . The results of GWAS need in many cases confirmation by linkage analysis.
GWAS have identified a large number of SNPs associated with disease phenotypes in humans, also in diverse worldwide populations . Early association mapping studies in crop plants were hampered by the availability of a limited amount of mapped markers and thus were mainly based on resequencing candidate genes [39, 40]. The development of comprehensive sets of SNP markers that can be interrogated in highly multiparallel HT SNP genotyping ushered in the era of germplasm diversity studies and GWAS in crop plants. [87, 98, 119, 133–138].
For barley, few germplasm collections including wild and landrace barley have been genotyped using custom-made OPAs (oligo-pool assays) by Illumina GoldenGate technology [139, 140]. SNP markers significantly associated with traits are being used to identify genomic regions that harbor candidate genes for these traits in various collaborative barley projects. It is relatively easy to detect marker-trait associations in barley cultivar populations that have extensive LD (5–10cM). Conversely, populations with low LD are supposed to provide high-resolution associations (landraces, <5cM; wild barley, <1cM) but the number of markers needed to find significant associations is relatively high. This rapid decay in LD in populations of wild germplasm is a key generic problem with genotyping for bi-allelic SNPs. Furthermore, ascertainment bias of bi-allelic SNP discovery i.e. caused by rare alleles and alleles not present in the elite cultivars complicates the situation in landraces and wild germplasm [73, 141]. Thus rare alleles are usually excluded from analysis. Higher marker coverage is required in order to identify candidate genes more efficiently in diverse collections. In case of barley, a high density SNP Chip has been developed, which contains 7864 bi-allelic SNPs coming from NGS of a broad range of barley cultivars (R. Waugh et al., unpublished data). Such customized arrays for HT SNP genotyping can accelerate genetic gain in breeding programs. First barley association panels have been genotyped using this resource (Figure 2). Similar SNP chips are becoming available for an increasing number of crop plants [142, 143]. Combined studies using GWA mapping, comparative analysis, linkage mapping, resequencing and functional characterization of candidate genes already enabled the identification of candidate genes for selected traits [66, 91, 128].
While genotyping arrays are useful for assessing population structure and the decay of LD across large numbers of samples, low-coverage whole-genome sequencing will become the genotyping method of choice for GWAS in plant species . As for humans, GWAS for plants will become the primary approach for identifying haplotypes and genes with common alleles influencing complex traits. However, common variations identified by GWAS account for only a small fraction of trait heritability and are unlikely to explain the majority of phenotypic variations of common traits. A potential source of the missing heritability is the contribution of rare alleles, insertion–deletion polymorphisms, copy number variants and epigenetic differences—that can be detected by NGS technologies. However, testing the association of rare variants with phenotypes of interest is challenging. Novel powerful association methods designed for large-scale resequencing data have to be developed [144–149].
In the future, it can be expected that mapping by sequencing will become the method of choice to discover the genes underlying quantitative trait variation in large purified germplasm collections [150–152] or epigenetic variation [84, 88, 153–155].
PGR of crop wild relatives or locally adapted crop landraces contain a rich repertoire of alleles that have been lost by selective processes that generated our today's elite cultivars. Such alleles represent an invaluable asset to cope with future challenges for sustainable agricultural development and food production [156, 157]. In the medium run, draft genome sequences will be available for all major and many neglected crops species and resequencing of these genomes in germplasm collections will yield a wealth of information. Transforming this deluge of data to information and knowledge will increase our understanding in all fields of genetics including evolution, ecology, domestication and breeding. Now is a crucial time to explore the potential implications of this information revolution for genebanks and to recognize opportunities and limitations in applying NGS tools and HT technologies to genebank collections [56, 158].
The availability of sequence information can make a significant contribution to the conservation of PGR. The high degree of redundancy found between different ex situ collections wastes a prohibitive amount of resources (see above). Across the board, two-third of the seed multiplication that is the most resource intense step of all conservation efforts, could be made redundant, if there were ways to unambiguously identify duplicates. Most attempts to identify duplicated samples suffered from the difficulty to agree on a common set of markers for a given species, manifold problems to reproduce DNA marker data between different labs. DNA sequences do not suffer from such shortcomings and therefore represent an ideal information platform to tackle the issue of redundancy. Arguably, sequencing of ex situ collections just for the sake of eliminating redundancy would be too expensive an undertaking. Combination of this effort with one of the issues mentioned below could provide an added value.
Clearly large crop collections cannot be sequenced in one draft. Against the backdrop of the evolving technology, a stepwise approach should be envisaged. Glaszmann et al.  suggested the development of ‘core reference sets’ for our crops. A core reference set (CRS) is to be understood as ‘a set of genetic stocks that are representative of the genetic resources of the crop and are used by the scientific community as a reference for an integrated characterization of its biological diversity’. Every CRS will serve as a public, standardized and well characterized resource for the scientific community. Well characterized, multiplied, isolated CRS have to be maintained for reference purposes, comparative studies, future reanalysis and integrative genomic analysis .
For this, already existing core collections must be transformed into genetic stocks, purified (homogeneous/stabilized) and taxonomically classified to facilitate practical choices for comparative association studies. One other approach is to select diverse accessions directly from genebank collections based on all available pre-existing characterization and evaluation data (C&E), pedigree, origin and collection site information. Survey genotyping to test the purity of accessions can be done with various molecular marker types such as inter-simple sequence repeats (ISSRs) or AFLPs. Mixed accessions including more than one genotype have to be advanced by SSD before entering into systematic molecular and phenotypic characterization (Figure 3).
The scope of a genebank may be extended to that of a DNA bank, similar to biobanks devoted to target medical research . The various implications of DNA banks for PGR have been discussed elsewhere. Common standards and Biobank Information Management Systems (BIMSs) have to be developed to deal with highly complex and diverse sets of metadata. Advanced technologies for high-quality biosample storage and management systems are available and have to be implemented [160, 161].
Precise phenotyping is one of the major bottlenecks in characterizing large collections. New, non-invasive, automated image analysis technologies are currently under development for systematic phenotyping under greenhouse and field conditions using novel sensing and imaging technologies. Phenomics is an emerging field, in which large and complex data sets are being produced. These require long-term storage for future reanalysis when software tools and algorithms have improved or for comparative analysis [162, 163]. Pre-selection of contrasting accessions by different strategies including allele mining approaches, genotyping using custom-made Bead Chips and morphological characterization are effective strategies to reduce the number of accessions prior to thorough phenotyping, the latter being the most time consuming step.
The ultimate goal regarding the valorization of PGR will be the deployment of novel alleles that will improve the trait under consideration. While resequencing of candidate genes is a straightforward approach to identify allelic variation, deployment of novel alleles in a breeding program is contingent on prior phenotypic validation. So far, this has been restricted to major genes, e.g. for disease resistance and seed quality. Validation of alleles of candidate genes for quantitative traits still remains a major challenge (i.e. Targeting Induced Local Lesions in Genomes (TILLING)), [164, 165]. In this regard, the ability to replace alleles by site specific recombination could spur the targeted utilization of PGR and thus greatly enhance the value chain of Biodiversity.
This work has been funded by Leibniz Institute of Plant Genetics and Crop Plant Research (IPK).
Benjamin Kilian is in the research group Genome Diversity at the IPK. His main interests are in genetic diversity, evolution and domestication of Triticeae. He is in charge of projects aiming at exploiting natural genetic diversity by whole-genome association mapping, high-throughput phenotyping and resequencing approaches.
Andreas Graner is managing director of the Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) and the head of the German Federal ex situ genebank for agricultural and horticultural plants. His research aims at developing genomics based approaches for the valorization of plant genetics resources of barley (Hordeum vulgare).