Analysis of full-length cDNA collections
We have previously described a gene annotation of chromosome 22 [17
] and its characterization [18
]. In this annotation, 546 genes were defined as protein-coding genes, 387 being full length and the remainder (159) being partial, mostly as a result of unconfirmed 5' ends, incomplete genomic sequence or partial gene duplication events. We subsequently identified and removed two full-length genes which we now consider to be antisense transcripts and have extended 13 genes to full length to give a total of 398 full-length protein-coding genes (see [19
] for details of the chromosome 22 ORFs). In the other cases of partial annotations we have not been able to extend the annotation sufficiently to allow identification of a complete ORF suitable for cloning. Therefore, for the purposes of this paper, where the aim is to identify clones containing complete ORFs, we only consider genes annotated as full-length protein coding as targets because of the difficulty of defining success for the partial genes.
We first considered the completeness of available full-length human cDNA collections, by comparing the DNA sequences of available cDNA library clones with our targeted set of 398 ORFs. For this analysis we used cDNA sequences downloaded from the major collections in January 2004. The publicly available cDNA collections analyzed were those from the Mammalian Gene Collection (MGC) [11
], the full-length long Japan collection (FLJ) [12
], the German cDNA Consortium (DKFZ) [10
] and the Kazusa cDNA project (KIAA) [13
]. In addition, we analyzed a commercially available set of cDNAs from Invitrogen. We aligned each of our target chromosome 22 ORFs to the available cDNA sequences to assess whether clones representing the entirety or any part of each of the chromosome 22 ORFs existed in each collection (Table , and see Materials and methods). This analysis showed that 240 out of 398 ORFs (60%) were represented by a cDNA clone with more than 95% identity over the full length of the ORF in at least one of the collections. In addition, a further 25 ORFs were covered by cDNA clones with gapped matches. However, only 227 (57% of the total ORFs) of these clones maintain the correct reading frame at the amino acid level. Examining the matches from individual cDNA clone collection showed that 80% of the full-length matches were provided by the MGC. This probably reflects the selection process in this program whereby initial sequencing of the ends of cDNA clones was used to select the optimal clone for complete sequencing. The KIAA collection provided full-length matches at approximately the same rate as the MGC, given the number of sequences available (1.25% chromosome 22 full-length matches out of the total MGC collection compared with 1.38% for KIAA) and notably provided the five largest clones matched that maintained the complete ORF (sizes between 4,719 base-pairs (bp) and 3,516 bp), reflecting the emphasis on long clones in the KIAA program. The FLJ and DKFZ collections gave rates of 0.28% and 0.27% respectively, presumably because a smaller proportion of full-length clones were sequenced. Analysis of the chromosome 22 genes from these collections shows that length, but not GC content, of the ORF is a significant factor in cloning success for these collections (Mann Whitney test, p
< 0.0007), that is, there is bias against longer ORF clones.
In summary, there is currently a 60% chance of obtaining a full-length cDNA clone from one of these collections, based on a sample of 1% of the human genome. The best single collection (MGC) provides 48% of the clones. This analysis of coverage, based on the subset of full-length protein-coding genes on chromosome 22, mimics the situation occurring in a positional cloning type strategy where one might want to obtain clones for a region identified by genetic mapping. However, it does not assess whether the collections are enriched or depleted for specific classes of gene by function, tissue distribution or level of expression. As chromosome 22 is particularly GC-rich, and compared to other human chromosomes the set of genes we have used for this assessment may be biased towards housekeeping genes with widespread or ubiquitous expression which are known to be enriched in GC-rich regions of the genome. Hence, results for specific classes of genes will differ. In any case, one can expect to obtain roughly half of the clones required from one of these collections. This is testimony to the considerable effort that has gone into constructing the resources, but is also frustrating, because other sources are required to make up the substantial remainder. To investigate whether other approaches could be used to address the completeness of cDNA clone resources, we developed an alternative method which is complimentary to cDNA library sequencing, and tested this approach on the same set of chromosome 22 ORFs.
Strategy for assembling a chromosome 22 ORF clone collection
Previous efforts in human to obtain cDNA clones suitable for future functional genomics studies have started by isolating the longest possible cDNA clones [10
]. In Caenorhabditis elegans
, an alternative strategy has been developed that is directly tailored to clone ORFs defined by gene annotations from cDNA libraries into Gateway vectors ready for functional genomics [20
]. The strategy we have developed (Figure ) uses genome annotation to define the full-length ORFs of interest. We then aim to amplify the ORF bracketed by short sequences at either end from uncloned primary cDNA (rather than cDNA libraries) using reverse transcription (RT) PCR with modifications to allow efficient and high-throughput application. The overall aim is to obtain cDNA clones containing the defined set of ORFs more efficiently than by cDNA library screening and to access ORFs not present in existing cDNA library collections. This strategy enables a single protocol to be used for all genes, and therefore does not require the import of any previously existing cDNA clones which might be from multiple laboratories and in several vector systems. In addition, it avoids potential biases associated with cloned cDNA libraries by utilizing uncloned cDNA. We chose not to format the ORF directly for a specific recombinational cloning system because this might compromise our ability to isolate some ORFs by RT PCR. Furthermore ORFs cloned into a generic vector will be useful for those who do not want to use a specific vector format. ORFs in clones derived and verified by this method can be readily transferred into recombinational cloning systems by PCR with appropriately designed oligonucleotides.
Summary of the ORF cloning method.
For the 398 targets, a nested set of two pairs of PCR oligonucleotide primers surrounding each ORF and including a short region of the 5' and 3' untranslated regions was identified. As these primers were to be used to extract a fragment containing the ORF from an extremely complex cDNA template, design was not restricted to the sequences at the start and stop of the ORF. A highly processive, proof-reading thermostable DNA polymerase was use to amplify the ORF from a pool of cDNA derived from various tissues using two rounds of PCR. In 76% of cases amplification with KOD Hot Start polymerase was successful in generating a PCR product of expected size under one set of amplification conditions (see Additional data file 2). However, where the expected-sized PCR fragment was not obtained, we were often able to obtain a fragment by subsequent repeat of the procedure with slight modifications including increasing the annealing temperature, using Pfu-turbo DNA polymerase as an alternative enzyme for one or both rounds of PCR, or using a cDNA template from a single tissue rather than the pooled cDNA. Fragments of the correct size were cloned into a T-tailed plasmid and the inserts were verified by complete sequencing using vector primers and anticipated gene specific primers. Assembled sequence for each clone was then compared with the expected gene sequence. Clones were accepted as correct versions of the ORF if identical to the expected sequence or if they contained only base changes that were known to be single-nucleotide polymorphisms (SNPs) or resulted in silent codon changes. Clones were also accepted with an alternative splicing event that maintained the ORF. Clones were rejected (for this study) if they contained a nonsynonymous base change that could not be confirmed as a known SNP ('unconfirmed bases') or if they resulted from an alternative splice or partially processed mRNA that did not maintain the ORF. When a clone generated from a fragment of the correct size failed validation because of the presence of unconfirmed bases, or retention of a small intron, an alternative clone was picked and sequenced until a correct version was obtained. If alternative splicing or partial processing events gave unacceptable clones, a further round of reamplification was undertaken in order to obtain a correct fragment. Finally, if clone inserts were repeatedly unacceptable as a result of mispriming events, annotation error or amplification of a related gene, a new set of nested oligonucleotide primers were designed.
Process error rate and SNPs
One possible concern with a strategy that involves reverse transcription and multiple rounds of PCR amplification followed by cloning of a single molecule is that the process will introduce base errors that alter the sequence of the final cloned ORF. Analysis of error rate here is complicated by the frequency of SNPs in humans and the fact that the starting cDNA template is a mix of cDNA from multiple human donors. We estimated the error rate from reverse transcription, PCR and the cloning process by sequencing 48 clones (covering 70,656 bases) containing the ORF of the NAGA gene. These were derived by our cloning protocol using cDNA from 10 lymphoblastoid cell lines as a template, as polymorphism would be easier to identify where each cDNA mix could only be one of two haplotypes. We categorized observed base changes as known SNPs if they were found to exist in dbSNP, in ESTs or in independently sequenced cDNA clones. Base changes were categorized as putative errors if no equivalent sequence could be identified. From this analysis we identified six putative base errors, giving an overall estimate of 0.085 errors per kilobase (kb), or one error per 7.8 clones assuming a mean ORF size of 1.5 kb.
Chromosome 22 ORF clone collection
Applying the strategy outlined above to the 398 chromosome 22 ORFs, we were able to clone and confirm 278 (70%) of the targeted chromosome 22 ORFs (see Additional data file 1). Sequences of the valid ORF clones are available [19
], and have been submitted to the EMBL database (accession numbers CR456339 to CR456616). Of these, 253 (91%) were derived from fragments generated with KOD polymerase. The remainder were generated using either an alternative polymerase (16; 6%) or a combination of polymerases (9; 3%) (see Additional data file 2). The universal cDNA pool was used for 249 (90%) of the clones, with 29 (10%) of clones derived from lower-complexity cDNA templates from single tissues. Of the accepted clones, 239 (86%) were the predicted splice form, with the remainder being an alternative splice which maintained the ORF; 183 (66%) clones matched the genomic DNA exactly. Of the 162 deviations from the genomic sequence (from 95 clones), 144 (89%) are previously identified SNPs either in dbSNP or dbEST, and 11 (7%) were not identified as known SNPs but did not alter the amino acid (see Additional data file 3). Seven changes were insertion/deletion events (see below). Of the 144 confirmed SNPs in a total of 372,916 bases (1 SNP every 2,590 bases), 81 were synonymous and 63 were nonsynonymous codon changes. Individual clones contained between one and eight SNPs (see Additional data file 3).
Insertions or deletions that retained the ORF were observed in five clones. None of these significantly altered the ORF, as four cases involved three bases while one involved 12 bases. We also observed a polymorphism in MSE55 which involved the insertion or deletion of six amino acid repeat units and exists in three different alleles. We amplified and sequenced genomic DNA fragments across this region from 152 chromosomes of European ancestry and found that all three alleles are common and in Hardy-Weinberg equilibrium. In this case the clone chosen for the ORF collection was the same allele as seen in the publicly available genomic sequence.
In three cases we obtained clones with insertion/deletion polymorphisms that altered the ORF but were supported by available chromosome 22 sequence. To determine whether to accept these clones as ORF cDNAs, we examined all three in more detail. The clone obtained for gene APOL4
contains a 2-bp insertion compared to the canonical genomic sequence annotation. This results in a frameshift that substantially extends the ORF from 127 amino acids to 348 amino acids. We designed a PCR reaction to directly interrogate the insertion/deletion and sequenced 144 chromosomes of European ancestry. Both alleles are common in this population, and are in Hardy-Weinberg equilibrium, with the 348-amino acid form being the minor allele at 46.5%. For bK216E10.6 we obtained an ORF clone with a 2-bp insertion compared to the genomic annotation, which results in an ORF that contains an extra 318 amino acids. Using the same strategy we sequenced 150 chromosomes and showed that the sequence producing the shorter peptide is the minor allele with a frequency of 20%, and the alleles are again in Hardy-Weinberg equilibrium. In this case we do not have an accepted clone, as the insertion increased the ORF length beyond the primer sequence. The third gene is TXN2
which shows a 2-bp insertion compared to the genomic sequence which is also found in an EST (AA586375), but has not been studied further. An insertion/deletion polymorphism that alters the ORF has previously been observed in MICA
on chromosome 6 [21
]. From these examples we concluded that insertion/deletion polymorphisms in ORFs that alter amino acid sequence may be relatively common, and can result in altered proteins. Complete ORF collections for outbred organisms like humans should ultimately address this issue and obtain examples of all common forms of the ORF.
In addition, we were able to amplify a PCR fragment which could be identified as originating from the correct gene for an additional 53 ORFs, but have not yet been able to obtain an acceptable clone because of the presence of unconfirmed bases, or problems with splice forms including partially processed transcripts. In most cases, only one or two amino acids are changed, which could make these clones usable under some circumstances, perhaps after site-directed mutagenesis. It is also possible that these are rarer SNPs that are not currently present in dbSNP. This suggests that by sequencing more examples we will be able to obtain clones for these ORFs in the near future. Thus the clone collection would cover 83% (331) of the targeted ORFs.
In total, we initiated the amplification and cloning process 538 times, excluding initial pilot trials. These 538 events break down as follows. For 180 (45%) targeted ORFs an acceptable clone was generated at the first attempt. Further rounds of clone-picking, reamplification or primer redesign generated a further 99 acceptable clones, 83 clones containing an unconfirmed base alteration, 54 clones containing an alternative splice which lost the ORF, 23 clones containing a rearrangement or erroneous amplification event, 19 clones with retained intron sequences, four clones containing unresolved sequencing problems and 36 clones which were not the expected gene. For 41 genes we were unable to amplify a suitable product or failed to clone the fragment. Hence the efficiency of the process in terms of the return of acceptable clones is approximately 52% (278/540).
A significant area of concern is where we were unable to generate a PCR product at all corresponding to the targeted gene. To find explanations for this type of failure, we examined both the sequence characteristics of the targeted ORF and elements of the experimental design. First we examined the crude differences between the classes of ORFs that we could and could not amplify. Figure shows a plot of the distributions of these two classes by GC content and length of ORF. Both GC content and length are significant predictors of success/failure to amplify (Mann Whitney test p < 0.0001), although logistic regression indicates there is no significant interaction between them. This suggests that alternative amplification protocols using different polymerases or PCR additives might result in additional ORFs being obtained. However, we have tested three additional enzymes or mixes (Pfu Ultra (Stratagene), Phusion (Finnzymes) and Expand 20 kb+ PCR (Roche)) and additives including DMSO, glycerol and betaine so far without identifying a design that solves the problem.
Figure 2 Sequence characteristics of cloned ORFs. (a) Plot of the distribution of the 398 chromosome 22 ORFs by GC content (%) and length (bases). Closed circles are the 331 ORFs that were isolated as acceptable clones (278) or as clones with the correct ORF but (more ...)
Next, we explored whether it was possible to amplify any part of the failed target cDNAs from the universal mix. For 51 of the genes where we failed to amplify the expected fragment, we designed additional nested oligonucleotide primer pairs to amplify a short (100-274 bp) sequence across a splice junction. In 39 cases (74%) we amplified a fragment of the correct size and sequence under our standard nested PCR conditions, suggesting that template is present in the cDNA mix for these ORFs (data not shown). Therefore, in most cases it is possible to amplify part of the targeted ORF from the cDNA mix using this protocol, indicating that the level of target in the mix is not limiting in these cases. Given that we know we can amplify parts of many of the problematic genes, one variation that could improve access to larger ORFs in the future would be to amplify larger transcripts in pieces that can then be reassembled into a single clone using appropriate restriction enzyme digestion and ligation or PCR cloning methods.
We also examined whether successful amplification was biased towards genes expressed in many tissues. Su et al
] have generated microarray data indicating the distribution of expression for many human genes over 47 tissues. We downloaded these data [23
] and were able to obtain tissue-distribution data for 206 of our 398 targeted genes. Codifying the diversity of tissues in which the genes were expressed as the proportion of positive tissues, and analyzing for the success or failure of amplification by logistic regression, indicated that the probability of amplifying a gene is not significantly affected by the diversity of its expression (data not shown).
We also examined diversity of expression by analyzing serial analysis of gene expression (SAGE) data derived from 242 Nla
III SAGE libraries downloaded from the SAGEmap resource [24
]. SAGE tags could be uniquely mapped to 315 of the 398 ORFs targeted. Using the number of SAGE libraries in which a SAGE tag for an ORF was found to represent the diversity of tissues in which the gene was expressed, no significant relationship was found with the probability of amplifying a gene (Mann Whitney test, p
= 0.84). Furthermore, because the SAGE tag data also gives an indication of expression level, we examined whether the mean expression level found by SAGE (mean normalized tags per million SAGE reads) affected probability of expression and again found no significant relationship (Mann Whitney test, p
= 0.79). Taken together these analyses indicate that the success of our amplification strategy is not significantly influenced by either the range of tissues in which a gene is expressed or the level of expression. Clearly there will be some genes expressed at low levels, at specific times or in specific tissues that will need special treatment, but these data suggest that these cases may be few.
Comparison of the chromosome 22 ORF collection with other cDNA sources
Returning to the cDNA clone collections, of the 331 targeted genes for which we can obtain either an acceptable clone (278) or a clone of the correct ORF but currently with a problem in its sequence (53), 208 genes also have clones in the cDNA clone collections we analyzed; 123 genes only have clones in the new chromosome 22 ORF set described here. In addition, for 19 genes which are represented in the cDNA clone collections we were unable to isolate a clone (Figures , ). This means that 88% (350) of the full-length protein-coding genes on chromosome 22 have cDNA clones. This also suggests that achieving 88% coverage of the readily accessible human ORFeome should be possible with an approach that combines the existing cDNA collections with directed RT-PCR as implemented in this analysis. Of course, because the actual number of human genes is still unknown and a significant number of genes have only partial annotation, there is still an indeterminate number of genes for which there is insufficient annotation to attempt the current strategy.
Figure 3 Schematic Venn diagram showing the relationships of the set of ORF clones isolated here compared with the full-length cDNA clones in current high-throughput clone collections (227 maintain the correct reading frame at the amino acid level from Table 1) (more ...)
We analyzed the four classes of genes (isolated by us and in the cDNA collections (BOTH), isolated only here (SANGER), isolated only by the cDNA collections (OTHER) and not isolated (NOT)) by GC content, length and diversity of expression as defined above for microarray data and SAGE using nonparametric analysis of variance (Figure , and Additional data file 5). ORF length was significantly higher (p < 0.001) for genes not isolated (NOT) as compared to those isolated by us (SANGER) or those isolated both by us and the cDNA collections (BOTH). This suggests, as expected, that longer ORFs are harder to amplify or clone. A significant influence (p < 0.05) was also found for higher GC content in the genes that were either not isolated (NOT) or found only in the cDNA collections (OTHER) compared with the SANGER or BOTH classes, reflecting the influence of GC content on the ability to amplify a cDNA target as discussed above. The only significant difference (p < 0.05) for diversity of expression was between genes cloned only by us (SANGER) and those present in both our set and the cDNA collections (BOTH), with less diversely expressed genes slightly enriched in the SANGER class. This result was seen only in the microarray data, although the effect was also present in the SAGE data at just below significance. This suggests that the method described here may be able to access less widely expressed genes than have been sampled by existing cDNA library sequencing, although the effect is small. Finally, analysis of the mean level of expression of the genes in the four classes based on the normalized SAGE tag count showed no significant difference, indicating that level of expression is not a significant factor for this set of genes.