Since the establishment of the HeLa cell line in 1951, it has been used as a model for numerous aspects of human biology with only minimal knowledge of its genomic properties. Here we provide the first detailed characterization of the genomic landscape of one HeLa line relative to the human reference genome. We integrated SNVs, deletions, inversions, tandem duplications, and CN changes along the genome to build a HeLa Kyoto genome. This provides a resource for the community, for instance, to inform primer or RNAi design. In addition, we provide high-resolution RNA-Seq data of the HeLa transcriptome and analyze them based on this cell line’s genome sequence.
We studied the relationship between CN variation and expression. CN is expected to impact gene expression levels in a proportional manner unless dosage compensation occurs (Aït Yahya-Graison et al. 2007
; Deng and Disteche 2010
). Our results showed that for genes present at the most prevalent CN state of 3, there is no general evidence of allele-specific dosage compensation and that general compensation, if active, is not strong. This finding corroborates observations that Schlattl et al. (2011)
have made in lymphoblastoid cell lines assessing polymorphic deletions. A lack of dosage compensation could impact the function of genes in protein complexes, where the stoichiometry of complex members is affected by CN changes.
We identified approximately 4.5 million SNVs and 0.5 million indels, in addition to ~3000 SVs, including deletions, insertions, and interchromosomal translocations (). More than 80% of these SNVs and short indels are most likely common variants segregating in the human population, since they are also present in SNV catalogs such as dbSNP (Sherry et al. 2001
) and the 1000 Genomes Project dataset (1000 Genomes Project Consortium et al. 2012
). The remaining variants likely comprise rare, tumor-specific, or cell-line-specific variants.
Our HeLa transcriptome data showed that close to 2000 genes are expressed higher than the physiological range of 16 human tissues. The functions enriched among these genes are related to proliferation, transcription, and DNA repair. The high expression of some DNA repair genes, some of which also carry potentially damaging NS mutations, suggests that even though HeLa displays high chromosomal instability, specific DNA repair mechanisms may be activated, perhaps irrespective of their effectiveness.
Our analysis is based on shotgun sequencing data of a HeLa cell line at moderate depth. Such data have specific limitations, in particular for phasing of distant variants (i.e., identifying variants co-occurring on a single chromosome) and detection of SVs affecting repetitive regions. These limitations could be overcome by additional data derived from, for example, fosmid libraries, chromosome separation, or large-scale mate pair libraries, although these experiments would be more costly and time-consuming. Here we focused on localized variants that are detectable from shotgun data, which already provide wide-ranging insights into the genomic landscape of HeLa. We expect that in future, researchers working with cell lines will routinely characterize the genomes of their lines. When the genomes of cell lines are unstable, such as for HeLa, the characterization might need to be regularly updated. We envisage that approaches similar to the one taken here might help ensure the integrity of cell lines and the quality of the biological insights derived from them.