Starting from the MGC resource, we created protein expression ORF clones in two different formats for over 3400 human genes, HFLEX7000, making them the largest contribution of fully sequence verified ORF clones to the ORFeome Collaboration (
www.Orfeomecollaboration.org ). The selection criteria for this subset were based on a combination of publication records for the individual gene and their association with biological as well as human disease MeSH terms, as defined by two programs, MedGene and BioGene. We aimed to reflect within this subset a similar distribution as it was present in MGC or the genome, and not to create a functionally or disease specific subset.
To assure the quality of this cDNA clone collection, we fully sequence verified all clones. By employing the appropriate formatted clone, users can add peptide tags to either end of the expressed protein or express protein without any additional amino acids at all. This is important for application reasons, e.g., for some proteins, the C-terminal amino acids may be important functionally (PDZ domain
[18]) requiring a translation stop at the natural position, whereas for other proteins the natural N-terminus is relevant (e.g., signal peptides for membrane protein trafficking
[19],
[20]). Some applications exploit the use of fusion tags at the C-terminus as an experimental readout (e.g., yeast two hybrid
[7]), or for capturing expressed proteins and confirming full length expression
[21].
We targeted over 3500 unique genes and obtained a fully sequence validated ORF clone for 97% (>3400) of the genes. The strategy of selecting only one clonal isolate per gene for sequencing successfully yielded 90% acceptable clones. This success rate dropped to 80% when second isolates of the failed clones were sequenced, raising questions about the likelihood of success of sequencing additional isolates for clones that failed after two attempts. Also, capture efficiency, as measured by the number of colonies after transformation, was not a predictor of eventual clone success; clones with either high or low colony count numbers were equally likely to be rejected at subsequent steps.
One set of troublesome ORFs identified during PCR and confirmed during sequencing revealed duplication of either the 5′ (near the ATG) or 3′ (near the stop codon) sequences used to design the PCR primer elsewhere in the clone. This led to inappropriate PCR priming and ultimately an inability to clone the gene. Any project using a similar strategy to convert MGC clones into ORF clones might find the same problems, and alternatives, e.g. restriction enzyme/ligation based or fragmented PCR cloning, should be considered for any such ORFs.
In summary, the clones from MGC provide an excellent resource for ORF clone production. The 97% success rate to produce fully sequence validated clones, of which 96% match perfectly to the template clone, underlines that this strategy is feasible in a cost effective manner. Together with our other human ORF sets, notably several hundred DNA binding proteins, over 500 kinases, 1000 breast cancer associated genes, this much broader collection of 3500 genes will be of great benefit to the research community.