In large genomics projects involving many different types of analyses of bacterial artificial chromosomes (BACs), such as fingerprinting, end sequencing (BES) and full BAC sequencing there are many opportunities for the identities of BACs to become confused. However, by comparing the results from the different analyses, inconsistencies can be identified and a set of high integrity BACs preferred for future research can be defined.
The location of each bovine BAC in the BAC fingerprint-based genome map and in the genome assembly were compared based on the reported BESs, and for a smaller number of BACs the full sequence. BACs with consistent positions in all three datasets, or if the full sequence was not available, for both the fingerprint map and BES-based alignments, were deemed to be correctly positioned. BACs with consistent BES-based and fingerprint-based locations, but with conflicting locations based on the fully sequenced BAC, appeared to have been misidentified during sequencing, and included a number of apparently swapped BACs. Inconsistencies between BES-based and fingerprint map positions identified thirty one plates from the CHORI-240 library that appear to have suffered substantial systematic problems during the end-sequencing of the BACs. No systematic problems were identified in the fingerprinting of the BACs. Analysis of BACs overlapping in the assembly identified a small overrepresentation of clones with substantial overlap in the library and a substantial enrichment of highly overlapping BACs on the same plate in the CHORI-240 library. More than half of these BACs appear to have been present as duplicates on the original BAC-library plates and thus should be avoided in subsequent projects.
Our analysis shows that ~95% of the bovine CHORI-240 library clones with both a BAC fingerprint and two BESs mapping to the genome in the expected orientations (~27% of all BACs) have consistent locations in the BAC fingerprint map and the genome assembly. We have developed a broadly applicable methodology for checking the integrity of BAC-based datasets even where only incomplete and partially assembled genomic sequence is available.