BAC libraries are a key component of many large genomics projects. They are used in the construction of maps of regions of genomes see [1
] for examples for the bovine genome, in the construction of maps of complete genomes [6
], to provide a framework for the sequencing of genomes [12
], and in comparative genomic hybridisation to study genome rearrangements [14
]. Many projects undertake fingerprint and BES analyses to construct physical maps of the target genome; this information can also be used to identify a tiling path of BACs to be sequenced as part of a genome sequencing strategy. To enable a range of different analyses to be undertaken by different groups, several copies of the BAC library may be created or subsets re-arrayed with a number of different organisations undertaking various parts of the fingerprinting, BAC end-sequencing and full BAC sequencing, thereby potentially increasing the chances of BAC assignment errors.
The route taken by the bovine genome project is defined as follows (Fig. ). The CHORI library, CHORI-240 [16
], was one of the major libraries used for the genome sequencing project [9
]. It was fingerprinted at the British Columbia Cancer Agency Genome Sciences Centre and contigs were constructed from BACs with overlapping fingerprints [9
]. These BAC contigs have been mapped to the bovine radiation hybrid [17
] and composite maps [9
], using markers assigned to BACs, largely by using BESs to create PCR probes. Smaller numbers of BACs from two other libraries, RPCI-42 [18
] and TAMBT [19
] were also included in the BAC fingerprint based map. A smaller set of CHORI-240 BACs have been included in a second BAC-fingerprint map [20
]. Most plates in the CHORI-240 library were BAC end-sequenced with smaller numbers from the other libraries shared across a number of different laboratories [21
]. Skim sequencing, typically 1.5× coverage, of approximately 10% of bovine BACs was undertaken as part of the bovine genome sequence at Baylor College of Medicine (BCM) [22
Figure 1 Flow of BAC clones and DNA samples for the bovine BAC fingerprinting, BAC end sequencing and genome sequencing projects. Results of the analyses undertaken in this publication are summarised. Numbers inside the boxes indicate the number of wells in the (more ...)
Throughout the course of the bovine genome project the CHORI-240 library was replicated a number of times and different methods were used by several research groups at varying times on independent equipment. As part of the processing in these laboratories, clones were re-arrayed several times from 384 to 96 well plates for growth of cells prior to preparation of DNA and further split onto two 96 well plates for sequencing of the two ends of each BAC clone [21
]. Despite this process being a frequent event there have been relatively few studies of the impact of these processes on the integrity of BAC assignments within large genomics projects. In an early assessment of the Human Genome Project, analyses of the association between clone name and sequence for the human BESs found a match rate for some sets of BACs of only 30% [23
]. In a specific test of integrity 91% of clones contained the same BESs when determined at two different centres [23
]. More recently during the construction of a set of BAC clones spanning the human genome approximately 7% of clones reanalysed did not generate the same fingerprint as generated in the original fingerprinting of the clones [24
]. In a study using the mouse genome BAC-libraries, a consistency rate of 95% for repeat sequencing of BAC ends was observed [25
]. The authors proposed that the high levels of automation in the processing pipelines should further increase the integrity of the datasets being generated [25
]. Similar analyses of EST datasets also indicate a range of tracking error rates, almost 38% in a sample of the IMAGE cDNA clones [26
], 11.1% in a set of bovine cDNAs [27
] and ~7.6% in a set of honey bee cDNA clones [28
]. In contrast, lane tracking errors during sequencing appear to be generally low, around 0.5% in a survey of a number of EST libraries [29
A number of genome projects have used fingerprint maps, BESs and genome sequence data to identify sets of reliable BAC clones spanning the genome [24
] or to build a BAC-based map of the genome [30
]. In these projects the consistency between the BAC fingerprint and BES based positions on the genome was used to include or exclude BACs from either the set of BACs or the map. In the set of 73,305 paired end sequenced BACs positioned on the rat genome assembly 2% were assigned to different chromosomes by their fingerprint and BESs [30
]. However, the source of the discrepancy, incorrect BAC fingerprint, or incorrect BES(s) was not reported.
The availability of the draft Btau3.1, and more recently the Btau4.0, assemblies of the bovine genome, which include sequence data derived from skim sequencing of a large number of BACs from the CHORI-240 library, provides an opportunity to determine the integrity of the BAC fingerprinting, BAC end sequencing and full BAC sequencing. In addition, it should also be possible to identify the most likely procedure during which a problem occurred. All three sets of data are incomplete, the Btau3.1 genome assembly is only a draft assembly, not all BACs have been fingerprinted and of those that have, not all have been included in contigs. Finally many BACs have only one or even no BES reads available (Table ). By comparing the positions of the same BACs in the genome assembly, using the BESs and other non-BES data from the same BAC, and in the BAC fingerprint based genome, BACs can be divided into groups on the basis of consistency between the various combinations of the three datasets (Fig. ). The challenge is to develop a methodology that enables the identification of problems and to make predictions about BACs for which we only have incomplete information. The high level of automation in the processing of the samples and the use of multi-channel pipettes etc. may help us do this since we might expect patterns of problematic BACs to be identifiable and thereby allowing us to make predictions based on patterns. Sporadic problems will be harder to detect and impossible to predict at the level of individual BACs.
Numbers of fingerprinted and end sequenced BACs
Diagrammatic representation of the multi-way analysis of the BAC fingerprint map, BAC-end sequences and BAC sequences.
Here we undertake the three way comparison of BAC-based genomic datasets using the International Bovine BAC Mapping and Genome Sequencing Consortia datasets.