High-resolution SNP genotyping (1,140,419 SNPs) was performed on 324 samples, including 69 hESC lines (130 samples), 37 hiPSC lines (56 samples), 11 somatic stem cell lines (11 samples), 41 primary cell lines (41 samples), and 20 tissue types (67 samples), as well as samples of differentiated hESC lines and mixtures of known ratios of a sample with a known duplication with a sample without that duplication (
Table S1). Copy number variants for all samples were identified in parallel using two algorithms, CNVPartition (Illumina, Inc.,
Table S2A) and Nexus (Biodiscovery, Inc.,
Table S2B), both of which have been demonstrated to be appropriate for copy number variation (CNV) identification from SNP Genotyping data from Illumina microarrays (
Kresse et al., 2010). The concordance between these two algorithms was high (76.08% for deletions, 98.60% for LOH, and 93.04% for duplications on the base-pair level) (
Table S2C). A subset of the CNV calls for both algorithms were validated using qRT-PCR. For the CNVPartition calls, 82% of 3-copy gains and 43% of 1-copy losses were confirmed. For Nexus, 68% of Allelic Imbalance, 71% of Copy Number Gain, 47% of Copy Number Loss, and 100% of Loss of Heterozygosity calls were confirmed (
Table S3, note that the Allelic Imbalance calls were judged to be correct if the qRT-PCR result indicated either a significant gain or a significant loss). Given the higher accuracy of the duplication calls in CNVPartition, and the ambiguity of the Allelic Imbalance calls in Nexus, CNVPartition was subsequently used as the primary algorithm. CNV calls that overlapped with common CNVs observed in a reference set of 450 HapMap samples (
Conrad et al., 2010) were identified and removed from subsequent analyses.
shows a map of the areas of CNV identified in all the samples. Based on validation of the CNV calls by qRT-PCR, which indicated that duplication calls were markedly more accurate than deletion calls, we focused on duplications and large deletions. We inspected the B-allele frequency (BAF) and log R ratio (LRR) plots in order to combine adjacent areas of CNV where appropriate; it is well appreciated that CNV calling algorithms frequently break up large CNV events into multiple calls. For example, the SIVF021 line was shown to have a complete trisomy of chromosome 21 both by prenatal genetic screening (PGS) of the embryo and karyotyping of the hESC line, but CNVPartition and Nexus both call multiple noncontiguous regions of CNV for this sample on chromosome 21 (
Table S2). A list of the regions mapped in is given in
Table S4.
Large regions of CNV in hESCs and hiPSCs
Several hESC samples showed duplications of large regions: the BG01 and BG01V samples both showed trisomy 12 and trisomy 17, but only the BG01 sample contained trisomy 3 and a deletion of the long arm of chromosome 7. The MIZ13 sample also contained trisomy 3. SIVF048 had a duplication of chromosome 5, while the WA07P34MNP sample had a deletion of the same chromosome (of note, this sample was from a directed differentiation experiment from hESC to motor neuron progenitor). The FES29 sample had a duplication of the short arm, and a deletion of the long arm, of chromosome 7. Large duplications of chromosomes 12, 17, and 20 were observed in multiple samples. A small region of LOH on chromosome 22 was identified by CNVPartition for the HFIB2IPS5 sample, but inspection of the BAF and LRR plots showed that it was a large region of LOH. In addition, large regions of 2-copy LOH were identified on the X chromosome in several samples. As these samples were male, these calls corresponded to duplications on the X chromosome; duplications of the entire chromosome were identified for the BG01 hESC and the TH1.60OCT4SOX2 hiPSC samples, and a large duplication of the q-arm of the chromosome was found in the BG01V sample. The aneuploidies in SIVF003 (chr16), SIVF011 (chr5), and SIVF021 (chr21) were known prior to derivation from PGS. Aneuploidies and large duplications of chromosomes 1, 12, 17, and X have been previously reported to be common in hESCs (
Baker et al., 2007;
Draper et al., 2004;
Imreh et al., 2006;
Mitalipova et al., 2005).
In a recent publication (
Narva et al., 2010), complex mosaic aneuploidy was described in one of the lines we genotyped, FES61. In our analysis, the B-allele frequency pattern from the SNP genotyping data indicated that this line contained genetic material from three male individuals (
Figure S1), which makes the data from this line uninterpretable for CNV analysis. We therefore excluded this line from further analysis.
Recurrent regions of CNV in hESCs and hiPSCs
In addition to these large duplications and deletions, we observed multiple smaller regions of CNV, including both deletions and duplications, which we examined to identify regions of recurrent CNV in the human pluripotent stem cell samples. As noted above, the validation rate for small duplications was significantly higher than for small deletions, and thus we focused on duplications for our analyses. We ensured that the recurrent regions identified were associated with the pluripotent state rather than a high frequency in the human population by comparing the CNVs found in the hPSC samples with those found in the non-PSC samples, as well as a dataset identifying common CNVs using 450 HapMap samples (
Conrad, Pinto, et al. 2010) ( and
Table S2).
In order to identify regions of recurrent duplication, we identified regions that were duplicated in multiple samples. Analyzing all samples, and using Fisher's exact test with a p-value cutoff of 0.05, yielded 152 regions where the duplications were distributed at a statistically significantly different rate between pluripotent and non-pluripotent samples (
Table S5). We then filtered for regions where the fraction of pluripotent samples was >90%, which yielded 18 regions. The two duplicated segments that fit these criteria were located on chromosome 12 and chromosome 20, and are highlighted in . The chromosome 12 region was duplicated in 9 out of 69 hESC lines, with the smallest common duplicated region encompassing NANOGP1 and SLC2A3 (). NANOG itself is upstream of NANOGP1, and was duplicated in 5 lines. The chromosome 20 region was identified in 7 out of 69 hESC lines and 1 out of 37 hiPSC lines. In our manual curation of the data, we identified duplications of this region in two additional samples, that CNVPartition failed to detect. For one (WA07P96CMD7), the population was mosaic and for the other (BG01P67), CNVPartition called duplications of regions flanking the recurrently duplication region, but missed the region itself. Six of the duplications we mapped included the DNMT3B gene itself (). In two recent publications, recurrent duplications were described in the 20q11.21 region of chromosome 20 in hESCs; these reports indicated that several hESC lines had duplications in a region near the pluripotency-associated gene DNMT3B, which codes for a
de novo DNA methyltransferase (
Lefort et al., 2008;
Spits et al., 2008). Mutations in this region of chromosome 20 have been noted in a number of cancers, suggesting that genetic elements in this region may be associated with hyperproliferation (
Guan et al., 1996;
Hurst et al., 2004;
Koynova et al., 2007;
Midorikawa et al., 2006;
Scotto et al., 2008;
Tanner et al., 1996;
Tonon et al., 2005). We also found that 5 out of 69 hESC and 1 out of 37 hiPSC lines had duplications in this region.
The occurrence of duplications near (but not including) the pluripotency-associated genes NANOG and DNMT3B suggests that the duplication of other genes in these regions are being selected for in the cultures, or that an upstream control element for these genes may be present in the duplicated regions. In several cases, the duplication event was observed in only one of multiple samples from the same cell line collected at different times. In some instances, a more “severe” aberration was present in an earlier passage sample from the same lab (see SIVF019P53 and SIVF019P67 in ), again reinforcing the need for detailed records regarding the passage history of cultures.
Comparison of CNVs in hESCs, hiPSCs, and Non-PSCs
For comparisons of the relative number and length of CNVs among hESCs, hiPSC, and non-PSCs, we decided to eliminate possible bias due to having multiple samples of some of the cell lines. For such cell lines, we included the one sample that had the largest number of total CNVs in our analysis. In addition, we removed hESC lines where preimplantation genetic diagnosis on the embryo had demonstrated that there was an aneuploidy.
Although there was considerable variation in the number of regions of CNV among the samples, overall the average numbers of regions of duplication and deletion were significantly higher in the hiPSCs compared to the non-PSCs (). The distribution of genomic aberrations across the hiPSC samples was rather even. In contrast, the distribution among hESC samples was highly skewed, so that although the average number of regions of duplication was not significantly higher in the hESCs than the non-PSCs, it was clear that a subset of hESC samples contained a very large number of duplications ().
Not including calls on the X and Y chromosomes (the CNV algorithms call a 1-copy deletion of the X for male samples, and a 0-copy deletion of the Y chromosome for female samples), detected aberrations ranged in size from 0.7-1,791 kb (0-copy deletion), 0.6-12,875 kb (1-copy deletion), and 0.9-6,896 kb (3-copy duplication) (
Figure S4A-S4E). The length of 3-copy duplications was higher in hESCs and hiPSCs than in non-PSCs (Wilcoxon Rank Sum Test p-values = 1.42 × 10
-15 and 5.32 × 10
-5, respectively), suggesting that either the incidence of large aberrations is higher in hPSC cultures, there is positive selection for cells with large aberrations in hPSC cultures, or there is negative selection against such cells in non-PSC cultures.
Correlation between CNVs and data quality or culture parameters
There was no correlation between the number of CNVs detected in the samples and passage number, the quality of the SNP genotyping data as measured by GenomeStudio genotyping call rate, or the Nexus quality score (
Figure S4F-S42H). We did not observe a correlation between passage number or passage method and the number of aberrations, even for samples collected from the same cell line (
Figure S4I-S4K). There were several very early passage hESC and hiPSC samples with large numbers of genomic aberrations, and the only noted association between passage number and the number of aberrations was in hiPSC lines that were meticulously cultured in a manner that ensured a linear path from samples collected serially during passage. In routine practice, the culture of any given line is highly branched, and investigators frequently do not know the true relationship among the various cryopreserved stocks, frozen nucleic acid samples, and live cultures for any given line. Our observations indicate that it is critical not only to record the passage number, but also the “pedigree,” of each culture, in order to be able to know with certainty that a previous assessment of the genomic stability of a line has any bearing on a current culture of that line. It is important to note that these findings do not exclude the possibility of an effect of culture conditions on genomic stability, but indicate that experiments to assess such an effect must be carefully designed and implemented.
Duplications of pseudogenes of pluripotency-associated genes
Interestingly, we found a high frequency of duplications in pseudogenes of the pluripotency-associated NANOG and OCT4/POU5F1 genes, including NANOGP1 (). It has been noted that genes active in early embryogenesis, such as OCT4/POU5F1, NANOG, GDF3, and STELLA, tend to have many pseudogenes (
Booth and Holland, 2004;
Elliman et al., 2006;
Liedtke et al., 2007;
Pain et al., 2005). NANOG has an unusually large number of pseudogenes (eleven) of which NANOGP1 is the only unprocessed pseudogene, retaining the exon-intron structure of the coding gene. Of the other NANOG pseudogenes, NANOGP4 is in the region of chromosome 7 duplicated in the FES29P39 sample, and NANOGP8 is in the region of chromosome 15 that was duplicated in a subpopulation of the late passage MIZ4P88 line (). NANOGP9 and NANOGP10 are on the X chromosome, and were duplicated in a subpopulation of the late passage UC06P112 sample (). In terms of OCT4/POU5F1 pseudogenes, POU5F1P4 is located on chromosome 1, which was trisomic in the WA07P95 sample, POU5F1P6 is located in a region of chromosome 3 that is duplicated in the SIVF002P17 and the MEL2P13 samples, and POU5F1P3 is located on chromosome 12, which was trisomic in samples from 5 hESC lines (). The ESI051P37 sample is interesting, in that it possessed a large deletion that encompasses the POU5F1 and NANOGP3 genes. There is little known about the role that transcribed pseudogenes may play in cellular function. In one report (
Hirotsune et al., 2003), a pseudogene was shown to stabilize the transcript of its protein-coding homolog, although its mechanism of action was unclear. It is intriguing to speculate that the pseudogenes of the pluripotency-associated genes may exert positive or negative regulatory influence over these genes.
Dynamic changes in genomic structure in hPSC populations
We observed cases where duplications appeared and took over hESC cultures. In the MIZ4 line, there was evidence that a trisomy of chromosome 15 had arisen in a subpopulation of cells between passage 33 and passage 88 (). In the UC06 line, the subpopulation of cells that had a trisomy of the X chromosome at passage 59 had taken over a larger proportion of the population by passage 112 (). These instances highlighted the need for improved methods for detecting CNVs in mosaic populations of cells. We analyzed mixtures of cells, where we varied the proportion of HDF51IPS11P33 cells, which contain a duplication in chromosome 20, and the parental HDF51 fibroblast line, which is genomically normal in this region. Using CNVPartition, we were able to detect the presence of the duplication when the percentage of HDF51IPS11P33 cells was ≥70% of the cells. However, calculating BAF distance can be used to detect the presence of the duplication when ≥20% of the population is affected (
and Figure S5A), indicating that improvements in CNV calling algorithms may be possible and would be very useful.
Genomic aberrations during reprogramming and passage of hiPSCs
hiPSCs present an ideal system for distinguishing between the effects of reprogramming and passage on genomic stability. They also confer the ability to determine with certainty whether a given alteration is new, since the parental differentiated cells can also be analyzed. Accordingly, we analyzed 3 samples from a primary human fetal fibroblast line, HDF51, and 12 independent hiPSC clones generated from that line. For the hiPSC clones, we collected samples at early (passage 5-8), mid (passage 12-15), and late (passage 25-34) passage, and analyzed at least the early and late passage samples. In addition to identifying duplications using CNVPartition, we identified deletions using a combination of CNVPartition and replicate error detection, which identifies the discrepancies between SNP calls from two samples (
Table S6). Since all of the samples originated from the same individual, the replicate error detection represented a way of improving our confidence in our deletion calls. Inspecting the duplication and deletion calls for the HDF51 and HDF51IPS samples (), we noticed that all 11 deletions appeared by the earliest passage timepoint, while 5 out of 6 duplications arose during the course of long-term passage. In fact, some of the deletions receded from the population over long-term passage, suggesting that they were positively selected during reprogramming and negatively selected during passage (
Figure S6).
Of the 7 duplicated regions that were present in an HDF51IPS line, but absent from the parental HDF51s, 6 contained the coding region and/or promoter region of at least one gene. The overexpression of five of these genes (in red) were positively associated with tumorigenicity or cell proliferation, while for one (FRS2, in green) low expression has been associated with poor prognosis in non-small cell lung cancer (
Iejima et al., 2010). BCL2L1 (in orange) has two isoforms, one of which is pro-apoptotic and the other is anti-apoptotic (
Boise et al., 1993). All 12 deletion regions overlapped at least one gene, and 5 of them contained genes that have evidence of tumor-suppressor activity.
It is notable that the presence of the transduced copies of the reprogramming factors did not confound our analysis by appearing as duplications in the reprogramming genes. This is due to the facts that the transduced genes included only the coding sequences (which have few SNPs), and that to identify a CNV region the CNV-calling algorithms require longer stretches of consecutive SNPs to be affected.
Genomic aberrations arising during differentiation
The most rapidly arising genomic aberrations in our dataset were identified in samples from a directed differentiation experiment. Parallel differentiations were performed using WA07 cells at passage 95 and 96, with samples collected from the undifferentiated cells (WA07P95), on differentiation day 2 (WA07P95CMD2 and WA07P96CMD2), and differentiation day 7 (WA07P95CMD7 and WA07P96CMD7). Partial duplications of 3 segments of chromosome 20 were found in the WA07P96CMD7 sample only (
and Figure S5B), indicating that they arose between day 2 and day 7 of differentiation. Comparing the BAF plots for WA07P96CMD7 to those from mixtures of known ratios of cells with and without a duplication of a smaller region of chromosome 20 (
and Figure S5A), we estimated the percent of cells in the population carrying duplications of the 3 segments to be 20%, 100%, and 50%. This finding points out that differentiation can be a highly selective process, and that genomically aberrant cells can rapidly take over a population undergoing differentiation. We suggest that it is important to assess the genomic normality of cells frequently, not only in the pluripotent state, but also at the endpoint of differentiation experiments or other treatments.
Correlations between genomic aberrations and gene expression
To determine whether the regions of frequent duplication in hESCs might have common features linked to the pluripotent phenotype, we used our large-scale mRNA expression database, which contains gene expression levels for a large number of pluripotent and non-pluripotent cell lines. We found that many of the genes in the recurrently duplicated region on chromosome 12 were more highly expressed in human pluripotent cells compared to multiple non-pluripotent cell types (
Figure S2 and S3A). There was not a statistically significant difference in the expression of these genes between in the hPSC samples that contained duplications and those that did not. However, this result could have been confounded by the differences in genetic background and culture conditions among the lines.
We therefore examined the expression of genes found within areas of duplication in samples where we had genetically matched controls (
Figure S3). There was higher expression of many genes on chromosome 20 in the WA07P96CMD7 samples, which had partial duplications of large stretches of this chromosome (shown in the BAF plot on the lower panel of
Figure S3A), compared to the WA07P95CMD7 samples, which were euploid for this chromosome. One of the genes that were most highly affected was DNMT3B, as seen on the panel on the right. We noted that the higher expression was not restricted to the areas involved in the duplications, indicating potential long-range effects of chromosomal aberrations on gene expression. These effects appeared to be weaker, but still present, on other chromosomes (see chromosome 12 panel). We ensured that this effect was not simply due to variations in gene expression between biological replicates by examining the corresponding data for the samples collected at day 0 and day 2 of the same experiment (upper 2 panels of
Figure S3A). We also had matched controls for the HDF51IPS lines, and we did see correlation between gene expression and presence of duplications for these samples as well (
Figure S3B). These findings suggest that duplications do result in increases in gene expression, both at the site of duplication as well as at distant sites, which can be detected when a genetically matched sample is used for comparison. Even though these gene expression changes are not apparent when comparing samples from unrelated cell lines, we believe this is not relevant, as a cell containing a genomic aberration is competing in culture with a population of otherwise genetically matched cells.