|Home | About | Journals | Submit | Contact Us | Français|
Therapeutic proteins and antibodies represent a $125 billion annual market. Chinese Hamster Ovary (CHO) derived cell lines are the preferred host cells for the production of therapeutic proteins. Here, we present a draft genomic sequence of the CHO-K1 ancestral cell line. The assembly comprises 2.45Gb genomic sequence with 24,383 predicted genes. We associate most scaffolds to 21 microfluidically-isolated chromosomes to identify chromosomal locations of genes. Furthermore, we investigate genes involved in glycosylation, which affects therapeutic protein quality, and viral susceptibility genes, which affect cell engineering and regulatory concerns. Specifically, homologs for most human glycosylation-associated genes are identified in the CHO-K1 genome, although 141 are not expressed under exponential growth. In addition, many important viral entry genes are present in the genome but not expressed, which may explain the unusual viral resistance property of CHO cell lines. We demonstrate how the availability of this genome sequence may facilitate genome-scale science for biopharmaceutical protein production.
Recombinant therapeutic proteins first entered the market more than 20 years ago. The breadth of the product portfolio of this $125 billion annual market has been recently reviewed 1. In this market, Chinese Hamster Ovary (CHO)-derived cell lines are the preferred host expression systems because of their advantages in producing complex therapeutics and manufacturing adaptability.
CHO cells can be genetically manipulated and grown either as adherent cells or in suspension. Methods for cell transfection, gene amplification, and clone selection in CHO cells are well characterized and widely used. Furthermore, CHO cells have an established history of regulatory approval for recombinant protein expression. Most importantly, these cells perform human-compatible post-translational modifications (e.g., glycosylation), thereby improving therapeutic efficacy, protein longevity, and reducing safety concerns. Various cell-line engineering strategies have been developed for CHO cells to enhance post-translational modifications such as antibody glycosylation and protein sialylation 2. As a result, CHO cell lines now play a dominant role in bioprocessing research and the development of therapeutic biopharmaceuticals, delivering up to several grams/L of these products in highly optimized production processes3.
The genome sequences of CHO cell lines represent useful tools that have been unavailable to the bioprocessing community Thus, applying genome-scale techniques to generate hyper-productive cell lines have been restricted to using ESTs and the potential of the omic technologies have not been fully realized4. To address this, we present a public draft genome sequence and comprehensive annotation of the commonly used ancestral CHO-K1 cell line. We investigate the CHO-K1 genome and transcriptome for insights into protein glycosylation and viral susceptibility since these processes affect the yield and quality of therapeutic protein production.
We note that the genomes of cell lines derived from CHO-K1 over the past few decades may contain large-scale rearrangements and that even clonal populations are known to diverge into heterogeneous subpopulations5, 6. Thus, we anticipate that further analyses and sequencing studies with other clonal populations and cell lines will be required. Nevertheless, the dissemination of this ancestral CHO genome sequence should be a valuable public resource.
Paired end Illumina reads of varying insert sizes were used for the de novo assembly of CHO-K1 (Supplementary Table 1). Using the assembler SOAPdenovo 7 (Methods), 2.45 Gb of the genome was assembled with a contig N50 of 38,289bp and scaffold N50 of 1.115Mb, with fewer than 3.3% gaps (Table 1; See 8 for definition of N50). The CHO-K1 genome size was estimated to be 2.6 Gb using the k-mer estimation method (Supplementary Fig. 1). See Supplementary Figures 2–3 for distributions of sequencing depth and GC content.
To assign scaffolds to chromosomes, individual chromosomes were isolated and amplified from single molecules using a microfluidic device9. Each chromosome preparation was amplified, barcoded and sequenced on an Illumina HiSeq 2000 (2×100bp reads). The reads from each chromosome preparation were aligned to the assembled scaffolds and the frequency of paired-end reads aligning from each chromosome preparation was computed and normalized. Metrics derived from the normalized frequencies were used for assigning scaffolds to a particular chromosome preparation (See Supplementary Text for additional details). All of the longest scaffolds that represent 50% of the assembly (top N50 scaffolds) had chromosome reads mapping to them. At least two thirds (68%) of the top N50 scaffolds could be unambiguously mapped to unique chromosome preparations (Table 1).
Different chromosomal counts have been reported for the CHO-K1 karyotype 10, presumably due to its genomic instability. To find evidence of multiple or duplicate chromosomes across the 22 sample preparations, we used the frequency of the paired-end reads aligning from each chromosome preparation to compute the correlation between the N50 scaffolds (See Supplementary Text). Scaffolds that are from the same chromosome will be highly correlated due to physical connection. Clustering of this correlation matrix revealed 21 large, discrete non-interacting blocks, which can be interpreted as the chromosomes containing the respective scaffolds (Fig. 1a and Supplementary Text). Consistent with this result, classical karyotyping found 21 chromosomes in CHO-K1 (Fig. 1b and Methods).
Approximately 37.79% of the CHO-K1 genome is comprised of transposable elements, as estimated from a combination of de novo repeat identification using RepeatModeller and analysis against the Repbase library11–13. This fraction of repeats is comparable to that in the mouse genome (37%) and lower than the human genome (46%). These transposable elements were classified into various categories (Supplementary Tables 2–4). The fraction of tandem repeats in the CHO genome (2.7%) is similar to rat (2.9%) and mouse (3.3%) but higher than human (1.5%). In summary, the repeat features of the CHO genome are more similar to the rodent genomes than the human genome. This observation is consistent with earlier reports in which the mouse and rat genomes were shown to have a higher fraction of repeats compared to the other mammals, especially primates 14–16.
To predict genes in the CHO-K1 genome, we used a combination of de novo gene prediction programs and homology-based methods. The predicted gene models were reconciled using the GLEAN algorithm (Methods)17. Additional evidence from the CHO-K1 transcriptome sequencing data was used to improve gene prediction (Methods and Supplementary Tables 5–6). The final gene set comprises 24,383 predicted genes, 29,291 transcripts, and 416 non-coding RNAs (Supplementary Text and Tables 8–11). Many of the predicted 24,383 genes have homologs in human (19,711), mouse (20,612) and rat (21,229) (See Supplementary Text for comparative analysis). The predicted proteins were functionally annotated using Swissprot, GO, TrEMBL, InterPro and KEGG. In all, 83% of predicted CHO-K1 proteins were functionally annotated (Supplementary Table 7). When compared to human, mouse, and rat, the distribution of CHO GO class assignments shows a significant coverage (i.e., >50% of the instances in mouse and significantly enriched, p < 0.01) of classes involved in translation, metabolism, and protein modification (Fig. 2). On the other hand, classes for which few genes were identified (i.e., < 1% of the instances in human and mouse and significantly depleted, p < 0.01) included “behavior”, “embryo development”, “anatomical structure morphogenesis”. Taken together, the GO classes that had the least coverage in the CHO-K1 genome may be less relevant for a cell line (Fig. 2).
The therapeutic proteins secreted by CHO cells often include post-translational modifications including N- or O-linked glycosylation. For some of these proteins, differential glycosylation can significantly affect functional activity and/or in vivo circulatory half life18. Furthermore, such modifications can induce immune responses if they differ from native human glycans. Therefore a genome-scale assessment of CHO glycosylation is important in the understanding of CHO-derived glycoprotein quality.
Out of 300 human genes associated with glycan synthesis and degradation, the CHO-K1 genome lacks homologs for only three genes (ALG13, CHST7, and CHST13) (see Supplementary Table 13). Since almost all glycosylation genes are found in CHO-K1, we expect that the expression and activities of these gene products is more important in determining the diversity of glycan structures on protein products in CHO. As an initial assessment of this hypothesis, we obtained CHO-K1 transcriptome data for exponentially growing cells using RNA-seq.
In the CHO-K1 transcriptome, about half of the predicted glycosylation genes had detectable transcripts (Fig. 3a). N-glycan transferases, mannosyltransferases, sugar-nucleotide synthesis genes and hyaluronoglucosaminidases were enriched for expression or completely expressed. These classes are critical for constructing the core parts of the glycan chains or dictating glycan localization. The significantly depleted classes among the expressed fraction of genes included the sulfotransferases, fucosyltransferases, and GalNAc transferases.
CHO cell lines often produce glycoforms similar to human glycans. However, CHO cells do not produce the bisecting (β4) N-acetylglucosamine (GlcNAc) branch, which is found on about 10% of human IgG glycoforms 19. The CHO LEC10 cell line remedies this with a gain-of-function mutation that induces MGAT3 expression, coding for GnTIII/GlcNAcTIII, which adds the bisecting GlcNAc residue 20. The fact that the LEC10 cell line gains this functionality suggests that the gene is present in the parent strain. Consistent with this, a homolog to this gene is found in the CHO-K1 genome, but is not expressed (Fig. 3b.i).
Most mammals have five primary types of fucosyltransferases, classified by the linkages between fucose and their substrates: α(1,2), α(1,3), α(1,4), α(1,6), and protein O-fucosyltransferases (see Supplementary Table 14 for the glycans fucosylated by each class). However, in the CHO-K1 transcriptome data, only fucosyltransferase 8 (FUT8) and the protein O-fucosyltransferases (POFUT1 and POFUT2) show expression. These add α(1,6)-linked fucose to N-linked glycans (see reaction F6Tg in Fig 3b.ii) or directly to serine/threonine residues, respectively. Indeed, suppression of FUT8 activity improves the quality of CHO-produced therapeutic antibodies, by removing fucose from the Fc oligosaccharides and altering its binding properties 21–23. Furthermore, since the α(1,2), α(1,3) and α(1,4)-linked fucosyltransferases are not expressed, the Lewis and ABO blood group glycans will probably not be generated in this CHO-K1 cell-line.
Glycan sialylation can impact the function, longevity, and immunogenic effects of proteins. Sialic acids often are the terminal sugar on N-linked glycans. These sugars may increase the lifespan of glycoproteins in the circulatory system by covering the penultimate galactose, which otherwise would bind to the hepatocyte asialoglycoprotein receptor and subsequently be degraded24. The CHO-K1 genome has homologs to all six human ST3Gal enzymes, which form α(2,3) linkages of sialic acid to galactose. Moreover, these genes are expressed as well (Fig. 3b.iii). Although homologs also exist for the human ST6Gal genes, which catalyze α(2,6) linkages of sialic acid to galactose, the transcriptome data show no evidence for ST6Gal gene expression (Fig. 3b.iii). This is consistent with the observation that CHO cells do not normally show ST6Gal activity 19, whereas terminal α(2,3)-linked sialic acid residues are abundant.
One challenge in therapeutic protein production is the avoidance of immunogenic responses 25, 26 that can arise from foreign glycan structures. For example, immunogenic responses can be induced by glycans harboring N-glycolylneuraminic acid (Neu5Gc), the hydroxylated derivative of the sialic acid N-acetylneuraminic acid (Neu5Ac). This hydroxylation is catalyzed by cytidine monophosphate-N-acetylneuraminic acid hydroxylase (CMAH), which is highly expressed and active in most mammals but not in humans27. Thus, the glycosylated proteins produced in non-human cell lines can induce an immune response in humans unless Neu5Gc production is controlled. Interestingly, although a CMAH homolog is found in the CHO-K1 genome, we did not detect any expression (Fig. 3b.iv). This result is consistent with the observation that CHO cell lines contain significantly lower levels of Neu5Gc sialylation in comparison to murine cell lines 28.
The antigen Gal-α(1,3)Gal can also elicit immunogenic responses in humans, since most individuals have anti-α-Gal antibodies29. The gene responsible for producing this epitope, glycoprotein α(1, 3) galactosyltransferase (Ggta1), is not expressed in human, but is active in mouse. Thus, recombinant IgAs produced in murine cell lines are significantly different from human IgAs. CHO cells lack the sufficient enzymatic machinery to produce glycan structures with the α-Gal epitopes30, except in very small subpopulations31. Furthermore, IgAs produced in CHO cells are similar to human IgA and lack the α-Gal epitope 32. Consistent with these findings, a homolog to the mouse Ggta1 gene is present in the CHO-K1 genome but was not expressed. See the supplementary text for additional discussion on glycans with potential relevance to immunogenic responses.
Despite harboring homologs to human sulfotransferases in the genome, CHO-K1 does not express most of them (Fig 3a). These enzymes play important roles in the generation of heparan sulfate, which is known to be important for entry of viruses such as HIV 33, adenoviruses 34 and herpes simplex virus (HSV) 35. Interestingly, CHO-K1 has been used extensively to investigate the need for heparan sulfate in viral entry. While CHO-K1 has heparan sulfate and chondroitin-4-sulfate, several mutants with reduced or no heparan sulfate have been produced by merely inhibiting a few enzymes 36.
In the CHO-K1 genome, homologs to most human heparan sulfate glucosamine O-sulfotransferases are identified. Consistent with previous studies 37–40, we found that heparan sulfate glucosamine 2-O-sulfotransferases and heparan sulfate glucosamine 6-O-sulfotransferases are expressed. However, no detectable expression was measured for heparan sulfate glucosamine 3-O-sulfotransferases (HS3ST), which make 3-O-sulfated heparan sulfate (important for HSV-1 entry 35; Fig. 4b). Although CHO-K1 is resistant to HSV-1 infection35, the addition of mouse HS3ST genes to CHO-K1 cells renders them susceptible to HSV-1 infection 41. This result suggests that CHO-K1 lacks HS3ST activity, which is consistent with the lack of detectable HS3ST expression in our study.
Viral infections can contaminate cell culture processes, thus affecting the quality and yield of recombinant protein production. Hence, the property of resistance to viral infection demonstrated by CHO cells further contributes to their preferred choice as hosts for therapeutic protein production42. Here, we investigate this property using the CHO-K1 genome and transcriptome. Twelve independent studies were summarized to compile a list of human genes important for viral infection 43. A total of 388 human genes that were identified in two or more of these independent studies were used for subsequent analysis. Among these, CHO-K1 homologs were not found for 4 genes (IL1A, SNRPC, MT1X, and CD58). Moreover, 158 genes lacked detectable expression levels in the CHO-K1 transcriptome. Among the unexpressed genes, the most significantly enriched GO-terms in the molecular function and biological process classes were glycoprotein binding, T-cell activation, and macromolecular assembly (Supplementary Tables 15–17). Many of these genes are either cell adhesion molecules (CAMs) important for viral entry and vesicular trafficking or plasma membrane proteins involved in viral recognition. Furthermore, several histone proteins involved in nucleosome assembly do not show any detectable levels of expression in the CHO-K1 transcriptome (Fig 4a).
The Herpes Simplex Virus (HSV) is a well-studied virus that is unable to infect CHO cells due to the lack of entry receptors 44. The CHO-K1 genome and transcriptome provide insights pertaining to these entry receptors and HSV infection (Fig. 4b). HSV-1 is known to require the Nectin-1/HveC receptor (PVRL1) and herpes virus entry mediator (HveM) for entry into host cells. Although, the CHO-K1 genome has homologs to both genes, expression was not detected. Integrins also are cellular receptors that regulate the cell-surface attachment and entry of viruses like HSV. Several integrin genes (e.g., ITGB3, ITGAV, and ITGAM) do not show evidence of expression in the transcriptome data. This lack of expression of integrin genes in CHO cells has been documented previously45, 46. The epidermal growth factor receptor (EGFR) also plays a role in the entry of HSV-1 into CHO-K1 cells. Reports indicate that CHO cells expressing EGFR are susceptible to HSV infection, whereas the wild type cells lacking EGFR expression are resistant47. Consistent with this observation, an EGFR homolog is in the CHO-K1 genome, but it is not expressed in the CHO-K1 transcriptome.
In addition to HSV, infection of CHO cells by other viruses such as pseudorabies virus is blocked at the level of viral penetration48. Receptors for other viruses like HIV and Hepatitis B virus (HBV) are either missing in the CHO-K1 genome or lacking expression in the transcriptome. For instance, the CD4 glycoprotein is not expressed in CHO-K1, thereby blocking entry of HIV-1 into host cells. Similarly, we do not find evidence for the CD58 gene in the CHO-K1 genome. The expression levels of the cell adhesion molecule CD58 correlates with HBV infection severity 49. Several other CAMs like CD48 and CD2 are also not expressed in the CHO-K1 transcriptome data. These proteins bind heparan sulfate and play an important role in viral infection50.
The resistance of CHO cells to viral infection is not limited to the regulation of viral entry. For instance, the restriction of Vaccinia virus replication in CHO cells is reported to occur due to the lack of the cowpox host range factor CP77, by causing a rapid shutdown of viral protein synthesis machinery 51. Consistent with this, the CHO-K1 genome does not encode this gene.
Chinese Hamster Ovary derived immortalized cell lines are the preferred host system for therapeutic protein production. CHO cell line engineering work has made incredible progress in optimizing products and titers by focusing manipulating single genes 2 and selecting clones with desirable traits following various treatments (e.g., mutagenesis or media adjustment). This progress has been accomplished without the availability of genomic sequences. Here, we present a publicly available annotated genome sequence for a CHO cell line, which represents yet another tool in the bioprocessing toolbox. It is not anticipated that this draft sequence will directly improve product titers to the extent as achieved through careful screens in the past. However, the CHO-K1 genomic sequence will facilitate the design of targeted genetic manipulations to aid in cell-line engineering (Fig. 5a), help in the elucidation of components underlying poorly characterized phenotypes (Fig. 5b), and allow for more comprehensive deployment of “omic” tools for CHO-K1 and related cell lines (Fig. 5c).
A genome-scale analysis of the glycosylation genes in the CHO-K1 genome identifies homologs to 99% of the human glycosylation-associated transcripts, with 53% of them expressed. The high coverage of homologs provides a unique opportunity for glycoform manipulation in CHO cells. Indeed, the high variability of gene silencing has led to the generation of the diverse selection of Lec mutant cell lines20. Moreover, it has been shown that clonal selection can lead to a sub-population of CHO cells expressing genes like GGTA1, that were thought to be inactive31. This result suggests that many other unexpressed glycosylation genes in the CHO genome can be potentially activated or silenced to alter the repertoire of glycan structures from CHO cells (Fig. 5a). In addition, the genome sequence will facilitate the development of genome-scale metabolic models for CHO cells. Such models allow for the assessment of the network-level effects of cell line treatments, and have been successful at predicting optimal designs for bioprocess optimization in prokaryotes 52–54.
The genome of CHO cells can also provide insight into less-well characterized phenotypes. For example, the global analysis of viral susceptibility genes in the CHO genome demonstrates that key plasma membrane receptor genes, CAMs, and genes involved T-cell activation and macromolecular assembly are not expressed in CHO-K1. Furthermore, the lack of expression of several key viral entry receptors for HSV-1, HIV, HBV, and pseudorabies virus opens up the possibility for an in-depth analysis of CHO cell resistance to viral infection. In addition, we found several key regulatory molecules such as histone factors to be lacking expression in CHO-K1. This analysis demonstrates that the genome sequence can be integrated with omic data analysis to generate hypotheses to guide further study into poorly characterized phenotypes of CHO cells (Fig. 5b).
The CHO-K1 genome should facilitate the interpretation of various omic data types. However, it is important to note that CHO-K1 is an ancestral cell line from which many CHO cell lines have been derived. During the course of the rather stringent manipulations involved in optimizing cell lines (e.g., selection for growth in different media compositions and switching cells from adherent cell culture to suspension-adapted growth), many genomic changes have likely occurred due to the inherent genomic instability of these cell lines (e.g., SNPs, indels and other structural variations). Moreover, the cell lines derived from CHO-K1 that are widely used in the industry (e.g. DUKX-B11 and DG44) may contain additional genetic changes from chemical and radiation mutagenesis 5, 6. Thus, this genome sequence of the ancestral K1 cell line should not be considered as directly representative of all CHO cell lines. However, the full coverage draft genomic sequence of the ancestral K1 cell line will serve as a foundation to support efforts in sequencing other CHO cell lines (Fig. 5c). These additional genomic sequences will provide a context for transcriptomic and proteomic data interpretation in the respective cell lines. It will also facilitate the identification or design of other potential targets or tools for cell line engineering (e.g., miRNAs, siRNAs, etc.).
The availability of the CHO-K1 genomic sequence provides a valuable resource for genome-scale CHO-cell research and will aid in manufacturing applications. However, we expect the quality of the genomic sequence will be iteratively improved over time as more genomic information becomes available for CHO-K1 and other CHO cell lines. Moreover, we anticipate that characterizing effects of sequence variations on gene products and expression would improve the functional annotation of these cell lines. These improvements may enhance the application of CHO-cell engineering and other techniques to improve protein production and quality.
The DNA of the CHO-K1 cell line was obtained from ATCC Catalog No. CCL-61.
Genomic libraries were prepared following the manufacturer’s standard instructions and sequenced on Illumina’s HiSeq 2000 platform.
We constructed CHO-K1 genome sequencing libraries with insert sizes of 200 bp, 350 bp, 500 bp, 800 bp, 2kb, 5kb, 10kb and 20kb to generate a total sequence of 343.64 Gb. (Supplementary Table 1). We first assembled the reads with short insert size (< 500 bp) using the de Bruijn graph based assembler SOAPdenovo (http://soap.genomics.org.cn) to obtain long contigs. In order to construct scaffolds, we realigned all the usable reads onto the contig sequences and obtained 80% of all the aligned paired-end reads. We then calculated the amount of shared paired-end relationships between each pair of contigs, weighted the rate of consistent and conflicting paired-ends, and then constructed the scaffolds step by step, in the increasing order of insert size. However, these scaffolds consisted of internal gaps mainly due to repeats that were masked before the scaffold construction phase. In order to resolve these gaps, we used the paired-end information to retrieve the read pairs that had one end mapped to the unique contig and the other located in the gap region and then performed a local assembly for these collected reads. See Table 1 for statistics on genome assembly.
CHO-K1 cells were grown in F12 medium for 5 days after recovery from the stock. 10 µg/ml colchicines were added into 50–75% confluent cells in one 6 cm dish to obtain a final concentration of 0.05µg/ml colchicine. After culturing for 12 hours in an incubator, the cells were then rinsed with PBS and trypsinized for 5 min. Care was taken to ensure that the cells were in a single-cell suspension. The cells were spun through the media for 2 minutes at 2000 rpm, re-suspended in 1 mL PBS, spun for 2 minutes at 2000 rpm and then re-suspended in 1mL 0.56% KCl. The cells were incubated at room temperature for 15 minutes and spun for 2 minutes at 2000 rpm. After removal of KCl, the cells were gently re-suspended in cold 1 mL MeOH:Acetic acid solution (3:1) and kept on ice for 10 minutes. The solution was then spun at 3000 rpm for 2 minutes, supernatant was removed and re-suspended in 200 µL fresh, cold MeOH: Acetic acid solution (3:1). After gentle vortexing, 10 µL of suspended cells were added onto a clean slide that is held at a 60° angle in the steam bath to let the MeOH evaporate. The cells were then stained with Giemsa stain (Invitrogen/Gibco 10092-013) for two hours. The slide was then rinsed with distilled water and mounted in 50% glycerol/50% PBS. The pictures of the chromosomes were taken using a 50X microscope.
We identified known transposable elements using RepeatMasker against the Repbase transposable element library. We also aligned the genome sequence to the curated transposable element related proteins using RepeatProteinMask to identify highly diverged transposable elements. In addition, we also used RepeatModeller to construct a de novo repeat library for the CHO-K1 cell line11–13.
We performed de novo gene prediction using Genscan, Augustus and GlimmerHMM with model parameters trained on human and predicted 25,542, 43,042 and 24,021 genes respectively. We aligned the gene sets from human, mouse and rat (Ensembl release 58) and predicted 33635, 29767, and 41836 genes respectively. We integrated these predictions into a combined gene set using the GLEAN pipeline to obtain a reconciled gene set containing 19,371 genes. In order to augment this gene set, we used CHO-K1 transcriptome data to annotate gene structures with the aid of the programs TopHat and Cufflinks. This resulted in a final gene set comprising 24,383 predicted genes and 29,291 transcripts.
We extracted total RNA using the TRIzol® Reagent (#15596-026), from exponentially growing cells cultured in F-12K Medium (Invitrogen) supplemented with 10% fetal bovine serum (FBS) at 37°C with an atmosphere of 5% CO2. The samples were treated with DNase in the presence of RNase inhibitor prior to cDNA synthesis. cDNA was sequenced using the Illumina GA2 technology with the paired end reads module.
The raw sequence data was filtered by removing reads which had adaptors, or reads that consisted of greater than 10% N’s or reads in which the majority base quality was less than 5. The filtered reads were mapped to the assembled scaffolds using the alignment tool TopHat, allowing a maximum mismatch of 1bp to identify the splice junctions. The unmapped reads were used in a seed-and-extend strategy by TopHat to identify reads spanning across the splice junction. This alignment was then assembled into transcripts using the software Cufflinks. Default values were used for all parameters except for the max intron length option (value used 150000). Transcripts with coverage less than 1X and length less than 200 bp were filtered out. The best potential coding region from each of the filtered transcripts was predicted using the software BestORF with parameters trained on mouse ESTs. Finally, the program cuffcompare (part of the Cufflinks suite) was used to compare and reconcile the protein sequences predicted from Cufflinks and BestORF and the Glean annotation.
A set of 300 glycosylation-associated human transcripts was compiled and curated from the glyco-gene chip array version 4 annotation (Functional Genomics Gateway http://www.functionalglycomics.org/static/consortium/resources/resourcecoree.shtml). We obtained the protein sequences for the human genes of interest from RefSeq Build 37.1 and Ensembl Release 58 and performed a BLAST alignment (blastP) against the protein sequences predicted in the CHO-K1 genome. We used an E-value cutoff of 1×10−5 to obtain the homologs for the genes.
Identification of ncRNAs: The entire fRNAdb was downloaded (http://www.ncrna.org/frnadb/catalog_taxonomy/download) and used as a reference for local blastn with the pooled sample of transcripts. To facilitate cross species exploration, relaxed parameters were used for both seeding and alignment and an E-value cutoff of 1×10−2 was implemented. Subsequently, the aligned sequences were annotated by mapping to annotation files from fRNAdb and sorted according to alignment scores.
The authors wish to acknowledge Bruce Kingham at the University of Delaware for technical assistance. This work was funded in part by National Natural Science Foundation (NSFC) of China award to young scientist (30725008), funding from Shenzhen government (ZYC200903240077A), funding for Shenzhen Key labs (CXB200903110066A), Guangdong Innovation Team Funding, National Basic Research Program of China (973 program, 2007CB815703), NIH 2P20RR016472-10, and NCI SBIR grant (NIH R44CA139977). MRA acknowledges funding from the Danish Agency for Science, Technology and Innovation grant 07-015498.
Author ContributionsB.O.P, J.W, I.F,X.X and Z.C conceived and designed the study. Z.C, Y.G, S.H, K.H.L performed sample preparation and sequencing. X.X, S.P and W.C performed the genome assembly. X.X, S.P, X.L, M.X, W.W, H.N and N.E.L performed genome annotation and evolutionary analysis. H.C.F, J.W, B.P, W.K, N.N and S.R.Q generated data and performed the microfluidic chromosomal analysis. The method and data for chromosome analysis was conceived and generated at Stanford. H.N., N.E.L, M.J.B, W.K and M.R.A performed the genomic and transcriptomic analysis of the glycosylation and viral susceptibility genes. H.N, N.E.L and B.O.P, wrote the paper and coordinated research efforts between authors. All authors read and approved the manuscript.