A WGS sequencing project is often divided into two phases: (1) assembly of the reads into scaffolds based on sequence overlap and (2) construction of chromosome pseudomolecules by placing and orienting the scaffolds using other information (i.e., genetic and physical maps). We found that the genetic map, while generally collinear with the genomic sequence, showed widely varying rates of recombination.shows chromosome 6 of soybean (formerly linkage group C2), which illustrates the phenomenon. The horizontal section in the middle of the chromosome covers the centromeric and pericentromeric regions where a large physical distance corresponds to a small genetic distance. The relative lack of recombination in this region results in poor resolution and difficulties in ordering and orienting those scaffolds. In contrast, the euchromatic regions at either end display high genetic-to-physical ratios (in the range of 1 centimorgan per 200,000 bases [1
kb]), enabling confident placement of most scaffolds. We were able to use chromosomal-scale signals in both dinucleotide signature differences and binding energies as an aid in ordering and orienting scaffolds in the soybean genome.
Figure 1 Physical (horizontal) versus genetic (vertical) distance for soybean chromosome 6. Note the flat region in the middle of the chromosome, corresponding to a portion of the chromosome with few recombination events. This hinders accurate marker-based placement (more ...)
Plots of binding energy and dinucleotide differences were overlaid with scaffold boundaries.gives an example of a dinucleotide plot along with the scaffold boundaries for chromosome 6. The darkened scaffolds show peaks that we believe correspond to the centromere, based on concentrated arrays of 91-92 base satellite repeats [5
] at those locations. Gradients shown in , as well as other supporting information (below) led to the chromosomal build shown in .
Figure 2 The plots in (a) and (b) show the dinucleotide difference of two assemblies of chromosome 6. The vertical lines correspond to scaffold boundaries. The dinucleotide signature of the darkened two scaffolds provided information about their orientation. (more ...)
The highlighted scaffolds in are scaffold_58 and scaffold_35 (from the Arachne build preceding the Glyma1.01 assembly release [1
]). Scaffold_58 contains 14 mapped markers, ranging from 102.9 to 103.9
cM on the 4.0 consensus genetic map [4
], and tentatively indicating that this scaffold should have a positive orientation. Scaffold_35 contains 7 mapped markers, ranging from 103.1 to 103.5
cM, also tentatively indicating a positive orientation. The scaffolds were initially placed with scaffold_58 first (cM 103.37), then scaffold_35 (103.41) in the orientations mentioned. These cM values are below the resolution of the map, however, so are fair game for re-evaluation. The dinucleotide plot, with peaks at the edges of both scaffolds, suggested that a reversal of orientation of both scaffolds was appropriate. This change was also supported by two FPC contigs [7
] that span the boundaries between scaffolds.
The contig WmFPC_Contig240 spans the boundary between scaffold_35 and scaffold_882, the scaffold directly to the right of scaffold_35 (see the soybase genome browser at http://soybase.org/
). WmFPC_Contig240 also spans the boundary between scaffold_882 and scaffold_3195, the scaffold directly to the right of scaffold_882. This strongly suggests that the position and orientation of scaffold_35 shown in is indeed correct. WmFPC_Contig6136 spans the boundary of scaffold_58 and scaffold_35 across the centromere. Integrity of this centromere-spanning scaffold is suspect (Will Nelson, personal communication), but together with the evidence above, the physical map provides some supporting evidence of both the correct orientation of scaffold_35 and scaffold_58.
Plots of dinucleotide binding energy along the chromosome versus genetic position were similarly useful in pseudomolecule assembly. We calculated the binding energy of 50
kb segments by adding up the energy of all the individual dinucleotides and averaging by the total count. When we plotted the averages across a whole chromosome, we observed large-scale patterns. The binding energy and variability tended to increase in the centromeric and pericentromeric regions. The average and standard deviation of binding energy from the beginning 17.5 million bases (Mb) and last 10
Mb of DNA from chromosome 6 (delineated by vertical lines in ) were 1.21 and 0.02.shows a plot of binding energy with vertical lines separating the regions. The average and standard deviation of the binding energy from the remaining middle section of the chromosome were 1.27 and 0.04. In addition to a larger variation, there tended to be large-scale oscillations present in the middle pericentromeric and centromeric regions of chromosomes. It was those large-scale patterns that we were able to exploit in assembling and orientating of scaffolds in a manner similar to our use of the dinucleotide signature.
Binding energy versus chromosomal location for soybean chromosome 6. The two vertical lines correspond to boundaries between euchromatic and heterochromatic regions, as determined from .
provides an example of the use of binding energy plots in chromosomal assembly for chromosome 2 (formerly D1b). The orientation of the darkened scaffold, scaffold_34, was provisionally reversed, on the assumption that a break in this gradient was unlikely to occur by chance precisely at the scaffold boundaries. The binding energy plot of the resulting assembly is shown in .
Figure 4 The plots in (a) and (b) show the binding energy of 50kb segments of two assemblies of chromosome 2. The vertical lines again correspond to scaffold boundaries. The darkened scaffold in (a) showed discontinuities in the connection to scaffolds (more ...)
When we further examined the marker data for that scaffold, we found suggestive evidence that the orientation shown in is correct.shows a plot of cM values versus physical location for scaffold_34 in the changed orientation. We note that the cM values of the first two markers, 68.1 and 71.7, are significantly less than the cM values of the last three markers, 79.2, 81.5, and 82.6. We also note that there is a flattening of the graph as we move in the pericentromeric region. This is what we expect as recombination events become rarer and marker resolution decreases. This provides additional evidence that the orientation of scaffold_34, shown in , is indeed correct.
Position versus cM values for markers along super33.
After observing and utilizing the small-scale signals outlined above, we decided to further characterize the DNA in order to better understand the biological meaning behind the signals. For chromosome 6, we examined the individual ρ*XY counts that were used to compute the dinucleotide differences shown in . In the centromeric region, some dinucleotides increase in relative count while others decrease, leading to the aggregate differences shown in . As an example of this, shows a plot of ρCG* and ρCC/GG* counts which illustrates the phenomenon. This analysis gives information about what dinucleotides frequencies differ along the chromosome but does not offer an underlying biological reason for those differences.
CG* and ρCC/GG* counts for 50kb stretches along chromosome 6. The counts stay relatively stable until the centromere, where they differ significantly from the rest of the genome.
Using the etandem repeat finding software, part of the EMBOSS software suite [8
], we identified many tandem repeats of dominant length 91 in that centromeric region. Those repeats have been characterized using FISH in previous studies [9
]. We also searched for a representative 91-length repeat in chromosome 6 using wublast [10
], retaining matches with at least 90% identity. Tandem arrays were then identified by counting hits that occurred within 910 base pairs (10x the repeat length) of each other. These arrays are found at exclusively one location–the centromere. ρCG
* and ρCC/GG
* counts, calculated directly from the 91 repeat, were 2.85 and 0.567, respectively, which differ significantly from the values of 0.540 and 1.21 for the entire soybean genome. Those CG and CC/GG count differences by themselves, when viewed in the context of how we determine δ
* values, are enough to explain dinucleotide differences of ~0.2. This is approximately the height of the peak in .
In attempting to explain the broad pericentromeric peaks in the binding energy plot of chromosome 2 (and all soybean chromosomes, data not shown), we analyzed the GC content of the euchromatic and heterochromatic regions as well as the GC content of a collection of LTR retrotransposons. The GC content of the LTR retrotransposons was 0.39, that of the euchromatic regions of chromosome 2 was 0.32, and that of the heterochromatic region of chromosome 2 was 0.37. When we removed the repeat content (LTRs and satellite repeats) from the heterochromatic region and recalculated, we saw a GC content of 0.33. This is enough to explain the broad peaks of and strongly suggests that it is the increased GC content of LTRs and repeats that lead to the patterns in binding energy.
We calculated similar binding energy and dinucleotide plots for chromosomes of grape, Arabidopsis, poplar, and rice to determine whether the patterns we observed in soybean were a general phenomenon or were specific to this species. Although we saw a few large-scale patterns along the chromosomes from those species, they appeared rarely and the patterns were in general more subdued than those seen in soybean (data not shown). Rice, poplar, and Arabidopsis showed relatively homogeneous peaks in the dinucleotide plots, with a noisy background consisting of narrow (~50–200
kb) peaks. The dinucleotide difference plot for the grape genome showed only minor, infrequent peaks. We are unsure whether the differences in the dinucleotide and binding energy plots were a result of differences between the genomes of those species or, rather, a difference in sequencing strategies used for the comparison genomes. The rice and poplar genomes were sequenced using clone-by-clone techniques and did not determine the sequence of all pericentromeric regions [11
]. The poplar genome was sequenced using a WGS strategy, but a larger proportion (~75
Mb) of the estimated genome was included in the pseudomolecule assemblies, and this excluded fraction was repeat dense [13