High Fidelity Duplications (HFDs) in the two assemblies of Bos taurus genome
One striking difference between the assemblies is the disparity in the number of large regions of sequence that are duplicated within the chromosomes with high fidelity between copies. We defined a High Fidelity Duplication (HFD) as any region >5 Kbp in length occurring in two copies in the assembly, such that the copies are >99% identical to each other and reside on the same chromosome. To find the HFDs we used the Nucmer software 
to map each assembly to itself and looked for non-overlapping self matches longer than 5 kbp with at least 99% identity. Btau 4.2 has 3,111 HFDs, while UMD 3.1 has 69. More surprisingly, only 2 of these HFDs appear in both assemblies. The Btau 4.2 regions cover 83 Mbp of sequence, while the UMD 3.1 duplications cover 1.3 Mbp.
In this paper we present analysis that shows that almost all HFDs in the Btau 4.2 and some in UMD 3.1 are assembly artifacts and therefore should be ignored for biological analysis.
shows the histograms of coverage for all HFDs in which the two assemblies disagree about copy number; i.e., at least one of the assemblies is incorrect. We created the set B1U2 containing the regions with exactly one copy in Btau 4.2 and two copies in the UMD 3.1 assembly; conversely, we created the set B2U1 containing the regions with two copies in Btau 4.2 and one copy in UMD 3.1. We show the distributions of read coverage for regions in B1U2 (dashed line) and B2U1 (solid line) as percentages of all regions. (Note that B2U1 is a much larger set, with 3,111 regions versus just 69 regions in B1U2.) Based on this WGS coverage statistic, 47 of the 69 regions (68%) in B1U2 are more likely to be true segmental duplications, suggesting that the UMD3.1 assembly is correct for these regions. In contrast, only 187 out of 3,111 regions (6%) in B2U1 appear to be true duplications, indicating that Btau 4.2 has a large number of erroneously duplicated sequences.
Histogram of the percentage of HFDs that belong to (i) set B2U1, duplicated in Btau 4.2 and single copy in UMD Bos taurus 3.1 (solid line), and (ii) set B1U2, single copy in Btau 4.2 and duplicated in UMD Bos taurus 3.1 (dashed line).
Independent validation of false duplications in Btau 4.0
The BGSAC authors devote part of their paper to discussing the biological implications of the segmental duplications in their Btau 4.0 assembly. However, in the online supplement, they remark that many of their duplications are likely a product of mis-assembly: “A total of 1,860 pairwise alignments (>20 kbp, >94% identity) corresponding to 92.45 Mbp of apparent duplicated sequence in Btau 4.0 could not be substantiated by WSSD.” Note that these duplicated sequences were omitted from the main analysis, but they are still present in the 4.0 and 4.2 assemblies. Our analysis suggests that the problem is even more extensive since 84% of the regions that we analyzed are shorter than 20 Kb (but longer than 5 kb, see the definition of the HFD above), and therefore they had to be included in the main analysis.
These indications of erroneous duplications in the Btau 4.0 assembly are supported by a recent independent study by 
, which examined intra-chromosomal duplication patterns in the Bos taurus
genome using fluorescent in situ hybridization (FISH). They compared Btau 4.0 and UMD 2.0 by analyzing 13 segments of the genome that were duplicated in only one of the assemblies. The FISH results were consistent with the UMD 2.0 assembly at 10 of 13 sites, while only 2 of 13 were consistent with Btau 4.0.