Our assembly of the B. taurus genome contains 2,857,605,192 bp, of which 2,612,820,649 bp are placed on one of the 30 chromosomes (Table ). The remaining 245 Mbp are contained in unplaced contiguous sequences (contigs). Figure shows the amount of sequence placed in each of the 29 autosomes and chromosome X. As the figure shows, length is inversely correlated with chromosome number, with a few exceptions, including chromosomes 11, 20, and 24.
Overall assembly statistics for the UMD2 assembly of B. taurus
Chromosome (Chr) lengths (in base pairs) based on amount of sequence in the B. taurus assembly placed on each chromosome.
We evaluated our assembly (University of Maryland assembly of B. taurus, release 2 (UMD2)) for completeness and correctness in several ways, comparing it to independent mapping data, to independently sequenced mRNA data, and to the alternative draft assembly produced by the Baylor College of Medicine Human Genome Sequencing Center, BosTau4.0 (BCM4). Each of the assemblies contains both 'placed' sequence, for which the location on the chromosomes is known, and 'unplaced' sequence. As shown in Table , the UMD2 assembly is larger than BCM4, with approximately 150 Mb (6%) more sequence placed onto chromosomes. In addition to total size, the N50 size is a very useful statistic for comparing genome assemblies: it represents the size N such that 50% of the genome is contained in contigs of size N or greater. For UMD2, the N50 contig size is 93,156 bp, while for BCM4 the N50 size is 81,627, approximately 14% smaller. Figure shows that for all values from N1 to N98, the UMD2 assembly is larger than BCM4.
Comparison of the B. taurus UMD2 and BCM4 assemblies according to sequence and mapping statistics
Figure 2 Cumulative plot of the N statistic for both the UMD2 (blue) and BCM4 (red) assemblies. Each point (X, Y) in the plot shows the contig size Y such that X% of the genome is contained in contigs of length Y or larger, for a genome of size 2.5 Gbp. For example, (more ...)
One of the most striking differences between the BCM4 and UMD2 assemblies is the assembly of the B. taurus X chromosome (BtX). UMD2 assigned 136 Mbp of sequence to the X chromosome, while the BCM4 assembly assigned only 83 Mbp. As we describe below, all sequence on BtX in our assembly is homologous to the human X chromosome (HsX).
Independently generated mapping data provide another measure of the quality of the assembly. Snelling et al
] created a B. taurus
map from three radiation hybrid panels, two genetic maps, and bacterial artificial chromosome (BAC) end sequences. We aligned all of the 17,254 markers (of which 17,193 are unique) in their composite map (Cmap) to both assemblies. A marker was considered as matching a chromosome if 90% of the marker sequence aligned with at least 95% identity. Of the Cmap markers, 14,620 align to the UMD2 assembly's chromosomes, versus 13,699 markers (6.3% fewer) for the BCM4 assembly. A small number of Cmap markers (119 and 82 for UMD2 and BCM4, respectively) mapped to a different chromosome from the one indicated in the Cmap data.
One likely reason for the larger size and greater genome coverage of our assembly is the BAC-based assembly strategy employed by the Atlas assembler used to build BCM4 [5
]. That strategy involved breaking the genome into BAC-sized pieces, assembling those pieces using BAC reads and whole-genome shotgun (WGS) reads, and then merging the results. This strategy fails to incorporate reads that fall outside the regions covered by BACs. We estimate that at least 2% of the UMD2 assembly is missing from BCM4 due to gaps between BACs.
We directly aligned the two assemblies against each other in order to detect any major disagreements. Ten of the 30 chromosomes contain one or more large (>500 kb) discrepancies, primarily inversions but also deletions and translocations. Figure illustrates two relatively large inversions, spanning 4 and 2.5 Mbp, on chromosomes 26 and 27. In both of these cases, as in all other large discrepancies, the Cmap data support the UMD2 assembly. Alignment plots for all 30 chromosomes are provided online in Additional data file 2.
Figure 3 Examples of large-scale disagreements between UMD2 and BCM4. (a) Dot-plot alignment of the region between 15 Mbp and 25 Mbp of chromosome 26 showing a large inversion in BCM4 compared to UMD2; (b) positions of Cmap markers for the same region of chromosome (more ...)
We conducted a comparison between the two assemblies for differences in the number of apparent segmental duplications, focusing on the types of duplications that might confound assembly. We collected all intra-chromosomal duplicated segments from both assemblies that were >5 kb in length and >95% identical. We found that UMD2 had significantly fewer duplications of this type, 662 versus 3,098 in BCM4. If these regions were incorrectly collapsed duplications in UMD2, then coverage by WGS reads should be higher (approximately twice the genome-wide level) and mate pairs flanking the regions would show inconsistencies [6
]. However, after analyzing regions that are single-copy in UMD2 and duplicated in BCM4, we found no substantial discrepancies in either mate pairs or coverage, indicating that the regions are most likely single-copy. It is possible that BCM4 failed to merge overlapping BACs (from different haplotypes), which would give the appearance of segmental duplications; further analysis will be necessary to resolve this question.
Another indicator of assembly completeness, and also of its potential for annotation, is the extent to which known gene sequences can be mapped onto it. We aligned 8,689 independently validated full-length cow mRNA sequences to the two assemblies, using spliced alignment mapping tools (see Materials and methods). Figure and Table S1 in Additional data file 1 show the number of sequences that had more than a fraction f of their bases contained in each genome for a range of f values. When all alignments of a gene are considered, UMD2 contains at least a portion of 8,659 mRNAs, compared to 8,555 for BCM4. All but two of the genes that map to BCM4 can be found in UMD2, whereas 106 are unique to UMD2 and not found in BCM4. Together, the two assemblies contain all but 28 of the mRNA sequences, as well as paralogs of 25 of the remaining 28 genes. More significant differences between the two genomes become apparent when the aligned fraction of the gene is considered. For instance, 8,042 genes have more than 90% of their bases mapped to the UMD2 genome, compared to only 7,771 genes for BCM4. We also directly compared the distributions of gene coverage between the two assemblies, shown in Figure . BCM4 has relatively more genes with low coverage, while UMD2 has a greater number of genes at the highest level (95-100%) of coverage. Overall, UMD2 has a more complete representation of the genes while containing nearly every gene in BCM4, and therefore provides a more comprehensive resource for gene annotation.
Figure 4 Assembly comparison by gene mapping. (a) Number of RefSeq mRNA sequences (out of 8,689) that can be aligned to each genome assembly at varying coverage cutoffs (horizontal axis) with at least 95% sequence identity. (b) Difference in the number of mRNAs (more ...)
Single nucleotide differences
In a base-by-base comparison, the UMD2 and BCM4 assemblies have >2.0 million single-nucleotide differences (SNDs). Some of these might be valid haplotype differences, in which the two assemblies are both correct, while others might be errors. We focused our analysis on a subset of positions where the underlying read data indicated that the position was highly likely to be homozygous, because a large majority (or all) reads agreed with one another. We also required that each SND was flanked by 50-bp exact matches in both assemblies (see Materials and methods), which reduced the set of SNDs to 389,015. We then looked for cases where no more than one read confirmed one assembly, and all other reads (at least three) confirmed the other assembly. The UMD2 assembly contains 10,636 instances of these apparent errors versus 30,750 in the BCM4 assembly. Thus, there were approximately three times more apparently erroneous SNDs in the BCM4 assembly.
Another way to look at fine-grain accuracy is to compare the assembly to independently generated sequences. We compared both assemblies to six finished BACS, from a different cow than the source of the whole-genome project. These BAC clones were not used in either the UMD2 or BCM4 assemblies. Ninety-six percent of the BAC sequence is contained in UMD2, versus 91% in BCM4. Considering only the portions of the BAC sequence that matched, the average disagreement between the BACs and UMD2 was 0.58%, whereas for BCM4 the discrepancy rate was 0.96%. Although some of these mismatches are likely due to true polymorphisms, the excess discrepancies in BCM4 are likely to represent erroneous base calls, indicating a higher error rate in BCM4.
The B. taurus Y chromosome
Because two-thirds of the data came from a female cow, and the male DNA was based on a BAC library (Materials and methods), only a very limited amount of the assembly can be assigned to the Y chromosome. (It is worth noting here that the BCM4 assembly does not assign any sequence to the Y chromosome.) We aligned all unplaced contigs to the human Y chromosome in an effort to identify B. taurus
Y sequence, and we identified 71 contigs that map to Y. When contigs in the same scaffolds were included, the total increased to 94 contigs, covering 832,527 bp. These contigs include a portion of the male sex determination gene SRY
]. Because few of these contigs are currently ordered with respect to one another, further work will be required to construct a better picture of the Y chromosome's structure.
Comparison to the human genome
Although humans are closer to mice than to cows, cows and humans have sufficient DNA sequence similarity to enable us to map the human genome almost entirely onto cow. Previous efforts based on mapping data showed that human and cow have approximately 201 homologous blocks of DNA [8
]. We used flexible criteria (see Materials and methods) to align all cow chromosomes to all human chromosomes, creating a new, high-resolution synteny map of human and cow. A region was considered a homologous synteny block (HSB) if the human-cow alignment extended for at least 250 Kbp and if it was not interrupted by an inversion or by an HSB on another chromosome. If two HSBs were interrupted by a gap of <3 Mbp and nothing else fell in that gap, the two blocks were merged. (Note that if a large region of synteny is interrupted by a distinct HSB, the interruption creates three HSBs.) A modified Oxford grid, shown in Table , shows the numbers of syntenic blocks shared between all human and cow chromosomes.
Modified Oxford grid showing the number of homologous synteny blocks on each chromosome of the cow (B. taurus) and human genomes
Our new, more-detailed map largely agrees with previously identified blocks, with a number of important differences. In a few cases, our map has fewer HSBs between a pair of chromosomes, but in many more cases, we detected new synteny blocks that had been missed previously; most of these were inversions or interruptions in larger blocks. Overall, our map increases the total number of HSBs to 268. These were created from 245 evolutionary breakpoints (268 minus 23 human chromosomes) that have appeared since the divergence of human and cow. For example, BtX and HsX were previously reported to share seven HSBs [8
]. Figure , which shows the alignment of BtX and HsX, reveals that five large blocks cover most of the two chromosomes, with one additional, much smaller block of 800 Kbp spanning the region from approximately 24.5 Mbp to 25.3 Mbp in BtX. Not visible on this scale, though, are seven additional inversions, bringing the total number of HSBs for the X chromosome to 14. We found no HSBs on BtX that mapped to any other human chromosomes besides X.
Figure 5 Aligment of B. taurus chromosome X to human chromosome X, showing regions of large-scale synteny. Most of the two chromosomes is shared in the five large blocks evident in the figure. Red: sequences are aligned in the same orientation; blue: sequences (more ...)
We also considered how many human genes can be found in the cow genome. For this analysis, we only considered curated human genes from the National Center for Biotechnology Information (NCBI) RefSeq database. We identified 25,710 RefSeq proteins representing 18,019 distinct human genes (many with alternative isoforms), and aligned these to the cow genome. Of the 18,019 human genes, 17,253 (95.7%) mapped to cow using our criteria. This left 766 genes that failed to map. Of these, 111 are annotated as 'hypothetical' proteins and may represent inaccurate gene models in human. The remaining 655 human genes failed to map either because they are too divergent or because the cow assembly is too fragmented or contains gaps in the regions containing those genes. Using the identical methods, we found that 17,107 human genes mapped onto the BCM4 assembly. Of the unmapped genes, 693 failed to map onto either assembly, 219 mapped onto UMD2 but not on BCM4, and 73 mapped onto BCM4 but not UMD2.
One surprising result was our finding that the initial assembly contained two unusual contaminants, Acinetobacter baumannii
and Serratia marcescens
. These bacteria are not used as sequencing reagents and are not usually detected when screening for contaminants; they appear to represent environmental contamination. The bacterial contigs, totaling 43,311 bp in 14 contigs, were removed from the UMD2 assembly, but are provided on our ftp site [9