The starting point for this analysis was HES3 human ES cell RNA, from which we generated a flcDNA library (A, B and C). We then generated two libraries: (1
) a GIS-PET library by the standard approach, called SHE001 (D), which comprised 613 905 unique PETs that were collapsed into 25 845 transcriptional units; and (2
) a GIS-PET library prepared by the Selection-MDA approach, called SHE002 (), which comprised 12 888 unique PETs which were collapsed into 3584 transcriptional units. To construct the MDA-amplified library (schematic in B), a single-PET ligation mixture was generated from the maxiprep of the flcDNA library, transformed into bacteria, and recovered for 4 h in the ‘Selection’ part of the procedure. The short 4 h growth in liquid media, allows for the selection of single insert clones because multiple insert clones have multiple origins of replication and cannot survive. However, the time is not long enough to result in crowding of bacteria in liquid media, such that size bias is minimized. To investigate whether the bacteria would have multiplied such that they crowd, we analyzed the optical density of the liquid media at 0, 1, 2 and 4 h. The optical density absorbance at 600 nm (OD600
) of the media increased from 0.728 at 0 h to 0.897 over 4 h. Using the approximation that 1 OD600
is ~1 × 109
), our bacteria increased from 7.3 × 108
to 9.0 × 108
cells over 4 h. Hence, our bacteria are still in log growth and not yet saturated (23
), thus the increase in cell number should not be sufficient to cause crowding. At the end of 4 h, the bacteria were washed well and harvested. Plasmids were prepared by miniprep and DNAse cleanup. A quality control check showed that clean plasmids (B) were obtained. PETs were then released by BamHI digestion (B). Released PETs were concatenated for Sanger sequencing (B). These quality controls indicate that the Selection-MDA procedures were successful in producing PETs for sequencing.
We analyzed the library of PET sequences derived from the MDA approach using standard GIS-PET quality control measures (4
), to investigate whether libraries prepared by the MDA approach are of good quality. Of a total 12 888 unique PETs sequenced, the number of PETs that could not be mapped to the human genome was 22.9%. This number is comparable to the percentage of unmappable PETs (26%) shown in a mouse embryonic stem cell library (4
), and indicates that the MDA approach has a low percentage of chimeras due to multimers as well as high accuracy amplification, which allows the amplified sequences to map well to the genome. In addition, the mapping accuracy (percentage within ± 100 bp of the transcription start site or polyadenylation site) for all known PETs in SHE002 was 92.5% for 5′ tags and 91.9% for 3′ tags, comparable to the mouse ES cell GIS-PET (4
), which showed results of 90.7% for 5′ tags and 86.9% for 3′ tags. Overall, the percentage of PETs with both 5′ and 3′ tags that map accurately is 88.4% for the entire library. While high, this measure includes mRNAs that have alternative splicing and alternative transcription start sites and hence represents a lower bound. The 12 888 unique PETs were collapsed into 3584 transcriptional units. To more accurately measure the mapping accuracy of the library, we examined PET sequences from the top 20 most abundant transcriptional units, which are well-annotated. The overall mapping accuracy is 98.5% for the top 20 transcriptional units of SHE002. This high level of mapping accuracy indicates that Selection-MDA method can accurately capture gene identification signatures.
In order to directly compare the performance of the Selection-MDA protocol with the standard protocol, we wanted to compare the quality control measures of the MDA-prepared GIS-PET library with those of a GIS-PET library (SHE001) prepared by conventional bacterial amplification. As the size of the data sampled from library SHE001 (the total number of PETs is 613 905) is almost 50-fold larger than the size sampled from library SHE002 (the total number of PETs is 12 888), a direct comparison of these two libraries will not be meaningful. Therefore, in order to compare the two libraries at the same number of PETs, we created three smaller virtual libraries, SHE004, SHE005 and SHE006 (), by random selection of data from bacterial propagation library SHE001, such that the virtual libraries had the same approximate size as that of the MDA-prepared SHE002. Differences within the set of these three virtual libraries would reflect sampling variation. Hence, if the differences between the MDA approach and the conventional approach are significant, then the differences between SHE002, and SHE004, SHE005 and SHE006 should be much larger than the differences between SHE004, SHE005 and SHE006. The percentages of PET matches to the genome, numbers of transcriptional units, as well as mapping accuracies of SHE004, SHE005 and SHE006 are comparable to that of SHE002, indicating that the MDA-prepared library is of similar quality as that of the conventionally-prepared library constructed from the same starting material ().
Analysis of GIS-PET library quality control measures
Next, we checked whether the MDA procedure caused any biases in the sample. Because MDA is a different amplification method from bacterial amplification, we wished to investigate if there was any base bias. Base bias was measured by calculating the GC percentage of the library. There is minimal base bias between the MDA method and the conventional method ().
Again because MDA is a different amplification method, we investigated whether there is any bias towards any category of genes, such as novel genes. We grouped the PETs and transcriptional units into ‘known genes’, ‘gene predictions’, ‘ESTs’ and ‘novel genes’. All libraries showed similar distributions, indicating minimal category bias ().
The Selection-MDA step could not have introduced a length bias in this particular library, because Selection-MDA was performed on single PET clones, which are all of a fixed size. Therefore, we could not test whether Selection-MDA would result in length biases or not. However, given that MDA was performed on the full-length cDNA library maxiprep to obtain more material for the construction of the single-PET library in the MDA procedure, we reasoned that this step might have introduced a length bias, and hence investigated whether there was a length bias. We tested for the presence of length bias by investigating the mRNA lengths of the best-matching known genes, ESTs or gene predictions, and found there was a length bias towards shorter mRNAs on the part of Selection-MDA, but the bias is small (). Given that the bias is small, it is possible that the apparent bias could still be the result of sampling variation.
Figure 3. Analysis of length bias between the MDA approach and the bacterial amplification approach. We tested for the presence of length bias by classifying the mRNA lengths of the best-matching Known Genes, ESTs, or Gene Predictions from each library into 500-bp (more ...)
Next, we reasoned that the contents of the SHE002, SHE004, SHE005 and SHE006 libraries should be similar, because the same starting full-length cDNA library was used for the preparation of the two libraries. Hence, we compared the top 20 most abundant transcriptional units of each library with each other. The average number of transcriptional units that are the same between SHE002 (the MDA-prepared library) and any randomly selected library from a bacterial propagation library is 13. The average number of transcriptional units that are the same between the bacterial propagation libraries is 14, suggesting that the agreement between the MDA method and the bacterial amplification method is similar to the agreement between randomly selected libraries chosen from the same bacterial propagation library (). This analysis thus indicates that the contents of the MDA-prepared library show a good match to those of the conventionally prepared library.
Identities of Top 20 most abundant transcriptional units (genes)