|Home | About | Journals | Submit | Contact Us | Français|
Genomic stability is critical for the clinical use of human embryonic and induced pluripotent stem cells. We performed high resolution SNP (single nucleotide polymorphism) analysis on 186 pluripotent and 119 non-pluripotent samples. We report a higher frequency of subchromosomal copy number variations in pluripotent samples compared to non-pluripotent samples, with variations enriched in specific genomic regions. The distribution of these variations differed between hESCs and hiPSCs, characterized by large numbers of duplications found in a few hESC samples and moderate numbers of deletions distributed across many hiPSC samples. For hiPSCs, the reprogramming process was associated with deletions of tumor suppressor genes, while time in culture was associated with duplications of oncogenic genes. We also observed duplications that arose during a differentiation protocol. Our results illustrate the dynamic nature of genomic abnormalities in pluripotent stem cells and the need for frequent genomic monitoring to assure phenotypic stability and clinical safety.
The tremendous self-renewal and differentiation capacities of human pluripotent stem cells (hPSCs) make them potential sources of differentiated cells for cell therapy. Cell therapies are subject to rigorous safety trials, and high priority is placed on demonstrating that the cells are non-tumorigenic (Fox, 2008). Since genetic aberrations have been strongly associated with cancers, it is important that preparations destined for clinical use are free from cancer-associated genomic alterations. Human embryonic stem cell (hESC) lines have been shown to become aneuploid in culture (Baker et al., 2007; Draper et al., 2004; Imreh et al., 2006; Maitra et al., 2005; Mitalipova et al., 2005), and the most frequent changes, trisomies of chromosomes 12 and 17, are also characteristic of malignant germ cell tumors (Atkin and Baker, 1982; Rodriguez et al., 1993; Skotheim et al., 2002). Aneuploidies can be detected by karyotyping, but less easily detectable subchromosomal genetic changes may also have adverse effects. Small abnormalities have been detected in hESCs using comparative genomic hybridization (CGH) and single-nucleotide polymorphism (SNP) genotyping (Lefort et al., 2008; Narva et al., 2010; Spits et al., 2008). These studies lacked sufficient resolution and power to identify cell type-associated duplications and deletions. A recent study has reported using gene expression data to detect genomic aberrations in a large number of hESCs and hiPSCs (Mayshar et al., 2010). However, the methods used could only reliably detect relatively large (≥10 megabase) aberrations, and the lack of non-pluripotent samples for comparison precluded the authors from determining which regions of genomic aberration were specific to pluripotent stem cells.
In this study, we performed high-resolution SNP genotyping on a large number of hESC lines, induced pluripotent stem cell lines (hiPSCs), somatic stem cells, primary cells, and tissues. We found that hESC lines had a higher frequency of genomic aberrations compared to the other cell types. Furthermore, we identified regions in the genome that had a greater tendency to be aberrant in the hESCs when compared to the other cell types examined. Recurrent regions of duplication were seen on chromosome 12, encompassing the pluripotency-associated transcription factor NANOG and a nearby NANOG pseudogene, and on chromosome 17, upstream of the DNA methyltransferase DNMT3B. Although the frequency of genomic aberrations seen in the hiPSC lines was similar to those of cultured somatic cells and tissues, we observed one of the recurrent areas of duplication characteristic of hESCs in one of the hiPSC lines.
Furthermore, comparison of 12 hiPSC lines generated from the same primary fibroblast line identified genomic aberrations that were present in the hiPSC lines and absent from the original fibroblast line. Analysis of early and late passage samples from these hiPSC lines allowed us to distinguish between events that arose during the process of reprogramming and those that accumulated during long-term passage. In general, deletions tended to occur with reprogramming and involve tumor suppressor genes, while duplications accumulated with passaging and tended to encompass tumor-promoting genes. These results suggest that human pluripotent stem cell populations are prone to genomic aberrations that could compromise their stability and utility for clinical applications, and that reprogramming and expansion in culture may lead to selection for particular genomic changes.
High-resolution SNP genotyping (1,140,419 SNPs) was performed on 324 samples, including 69 hESC lines (130 samples), 37 hiPSC lines (56 samples), 11 somatic stem cell lines (11 samples), 41 primary cell lines (41 samples), and 20 tissue types (67 samples), as well as samples of differentiated hESC lines and mixtures of known ratios of a sample with a known duplication with a sample without that duplication (Table S1). Copy number variants for all samples were identified in parallel using two algorithms, CNVPartition (Illumina, Inc., Table S2A) and Nexus (Biodiscovery, Inc., Table S2B), both of which have been demonstrated to be appropriate for copy number variation (CNV) identification from SNP Genotyping data from Illumina microarrays (Kresse et al., 2010). The concordance between these two algorithms was high (76.08% for deletions, 98.60% for LOH, and 93.04% for duplications on the base-pair level) (Table S2C). A subset of the CNV calls for both algorithms were validated using qRT-PCR. For the CNVPartition calls, 82% of 3-copy gains and 43% of 1-copy losses were confirmed. For Nexus, 68% of Allelic Imbalance, 71% of Copy Number Gain, 47% of Copy Number Loss, and 100% of Loss of Heterozygosity calls were confirmed (Table S3, note that the Allelic Imbalance calls were judged to be correct if the qRT-PCR result indicated either a significant gain or a significant loss). Given the higher accuracy of the duplication calls in CNVPartition, and the ambiguity of the Allelic Imbalance calls in Nexus, CNVPartition was subsequently used as the primary algorithm. CNV calls that overlapped with common CNVs observed in a reference set of 450 HapMap samples (Conrad et al., 2010) were identified and removed from subsequent analyses.
Figure 1 shows a map of the areas of CNV identified in all the samples. Based on validation of the CNV calls by qRT-PCR, which indicated that duplication calls were markedly more accurate than deletion calls, we focused on duplications and large deletions. We inspected the B-allele frequency (BAF) and log R ratio (LRR) plots in order to combine adjacent areas of CNV where appropriate; it is well appreciated that CNV calling algorithms frequently break up large CNV events into multiple calls. For example, the SIVF021 line was shown to have a complete trisomy of chromosome 21 both by prenatal genetic screening (PGS) of the embryo and karyotyping of the hESC line, but CNVPartition and Nexus both call multiple noncontiguous regions of CNV for this sample on chromosome 21 (Table S2). A list of the regions mapped in Figure 1 is given in Table S4.
Several hESC samples showed duplications of large regions: the BG01 and BG01V samples both showed trisomy 12 and trisomy 17, but only the BG01 sample contained trisomy 3 and a deletion of the long arm of chromosome 7. The MIZ13 sample also contained trisomy 3. SIVF048 had a duplication of chromosome 5, while the WA07P34MNP sample had a deletion of the same chromosome (of note, this sample was from a directed differentiation experiment from hESC to motor neuron progenitor). The FES29 sample had a duplication of the short arm, and a deletion of the long arm, of chromosome 7. Large duplications of chromosomes 12, 17, and 20 were observed in multiple samples. A small region of LOH on chromosome 22 was identified by CNVPartition for the HFIB2IPS5 sample, but inspection of the BAF and LRR plots showed that it was a large region of LOH. In addition, large regions of 2-copy LOH were identified on the X chromosome in several samples. As these samples were male, these calls corresponded to duplications on the X chromosome; duplications of the entire chromosome were identified for the BG01 hESC and the TH1.60OCT4SOX2 hiPSC samples, and a large duplication of the q-arm of the chromosome was found in the BG01V sample. The aneuploidies in SIVF003 (chr16), SIVF011 (chr5), and SIVF021 (chr21) were known prior to derivation from PGS. Aneuploidies and large duplications of chromosomes 1, 12, 17, and X have been previously reported to be common in hESCs (Baker et al., 2007; Draper et al., 2004; Imreh et al., 2006; Mitalipova et al., 2005).
In a recent publication (Narva et al., 2010), complex mosaic aneuploidy was described in one of the lines we genotyped, FES61. In our analysis, the B-allele frequency pattern from the SNP genotyping data indicated that this line contained genetic material from three male individuals (Figure S1), which makes the data from this line uninterpretable for CNV analysis. We therefore excluded this line from further analysis.
In addition to these large duplications and deletions, we observed multiple smaller regions of CNV, including both deletions and duplications, which we examined to identify regions of recurrent CNV in the human pluripotent stem cell samples. As noted above, the validation rate for small duplications was significantly higher than for small deletions, and thus we focused on duplications for our analyses. We ensured that the recurrent regions identified were associated with the pluripotent state rather than a high frequency in the human population by comparing the CNVs found in the hPSC samples with those found in the non-PSC samples, as well as a dataset identifying common CNVs using 450 HapMap samples (Conrad, Pinto, et al. 2010) (Figure 1 and Table S2).
In order to identify regions of recurrent duplication, we identified regions that were duplicated in multiple samples. Analyzing all samples, and using Fisher's exact test with a p-value cutoff of 0.05, yielded 152 regions where the duplications were distributed at a statistically significantly different rate between pluripotent and non-pluripotent samples (Table S5). We then filtered for regions where the fraction of pluripotent samples was >90%, which yielded 18 regions. The two duplicated segments that fit these criteria were located on chromosome 12 and chromosome 20, and are highlighted in Figure 2. The chromosome 12 region was duplicated in 9 out of 69 hESC lines, with the smallest common duplicated region encompassing NANOGP1 and SLC2A3 (Figure 2A). NANOG itself is upstream of NANOGP1, and was duplicated in 5 lines. The chromosome 20 region was identified in 7 out of 69 hESC lines and 1 out of 37 hiPSC lines. In our manual curation of the data, we identified duplications of this region in two additional samples, that CNVPartition failed to detect. For one (WA07P96CMD7), the population was mosaic and for the other (BG01P67), CNVPartition called duplications of regions flanking the recurrently duplication region, but missed the region itself. Six of the duplications we mapped included the DNMT3B gene itself (Figure 2B). In two recent publications, recurrent duplications were described in the 20q11.21 region of chromosome 20 in hESCs; these reports indicated that several hESC lines had duplications in a region near the pluripotency-associated gene DNMT3B, which codes for a de novo DNA methyltransferase (Lefort et al., 2008; Spits et al., 2008). Mutations in this region of chromosome 20 have been noted in a number of cancers, suggesting that genetic elements in this region may be associated with hyperproliferation (Guan et al., 1996; Hurst et al., 2004; Koynova et al., 2007; Midorikawa et al., 2006; Scotto et al., 2008; Tanner et al., 1996; Tonon et al., 2005). We also found that 5 out of 69 hESC and 1 out of 37 hiPSC lines had duplications in this region.
The occurrence of duplications near (but not including) the pluripotency-associated genes NANOG and DNMT3B suggests that the duplication of other genes in these regions are being selected for in the cultures, or that an upstream control element for these genes may be present in the duplicated regions. In several cases, the duplication event was observed in only one of multiple samples from the same cell line collected at different times. In some instances, a more “severe” aberration was present in an earlier passage sample from the same lab (see SIVF019P53 and SIVF019P67 in Figure 2B), again reinforcing the need for detailed records regarding the passage history of cultures.
For comparisons of the relative number and length of CNVs among hESCs, hiPSC, and non-PSCs, we decided to eliminate possible bias due to having multiple samples of some of the cell lines. For such cell lines, we included the one sample that had the largest number of total CNVs in our analysis. In addition, we removed hESC lines where preimplantation genetic diagnosis on the embryo had demonstrated that there was an aneuploidy.
Although there was considerable variation in the number of regions of CNV among the samples, overall the average numbers of regions of duplication and deletion were significantly higher in the hiPSCs compared to the non-PSCs (Figure 3). The distribution of genomic aberrations across the hiPSC samples was rather even. In contrast, the distribution among hESC samples was highly skewed, so that although the average number of regions of duplication was not significantly higher in the hESCs than the non-PSCs, it was clear that a subset of hESC samples contained a very large number of duplications (Figure 3).
Not including calls on the X and Y chromosomes (the CNV algorithms call a 1-copy deletion of the X for male samples, and a 0-copy deletion of the Y chromosome for female samples), detected aberrations ranged in size from 0.7-1,791 kb (0-copy deletion), 0.6-12,875 kb (1-copy deletion), and 0.9-6,896 kb (3-copy duplication) (Figure S4A-S4E). The length of 3-copy duplications was higher in hESCs and hiPSCs than in non-PSCs (Wilcoxon Rank Sum Test p-values = 1.42 × 10-15 and 5.32 × 10-5, respectively), suggesting that either the incidence of large aberrations is higher in hPSC cultures, there is positive selection for cells with large aberrations in hPSC cultures, or there is negative selection against such cells in non-PSC cultures.
There was no correlation between the number of CNVs detected in the samples and passage number, the quality of the SNP genotyping data as measured by GenomeStudio genotyping call rate, or the Nexus quality score (Figure S4F-S42H). We did not observe a correlation between passage number or passage method and the number of aberrations, even for samples collected from the same cell line (Figure S4I-S4K). There were several very early passage hESC and hiPSC samples with large numbers of genomic aberrations, and the only noted association between passage number and the number of aberrations was in hiPSC lines that were meticulously cultured in a manner that ensured a linear path from samples collected serially during passage. In routine practice, the culture of any given line is highly branched, and investigators frequently do not know the true relationship among the various cryopreserved stocks, frozen nucleic acid samples, and live cultures for any given line. Our observations indicate that it is critical not only to record the passage number, but also the “pedigree,” of each culture, in order to be able to know with certainty that a previous assessment of the genomic stability of a line has any bearing on a current culture of that line. It is important to note that these findings do not exclude the possibility of an effect of culture conditions on genomic stability, but indicate that experiments to assess such an effect must be carefully designed and implemented.
Interestingly, we found a high frequency of duplications in pseudogenes of the pluripotency-associated NANOG and OCT4/POU5F1 genes, including NANOGP1 (Figure 2A). It has been noted that genes active in early embryogenesis, such as OCT4/POU5F1, NANOG, GDF3, and STELLA, tend to have many pseudogenes (Booth and Holland, 2004; Elliman et al., 2006; Liedtke et al., 2007; Pain et al., 2005). NANOG has an unusually large number of pseudogenes (eleven) of which NANOGP1 is the only unprocessed pseudogene, retaining the exon-intron structure of the coding gene. Of the other NANOG pseudogenes, NANOGP4 is in the region of chromosome 7 duplicated in the FES29P39 sample, and NANOGP8 is in the region of chromosome 15 that was duplicated in a subpopulation of the late passage MIZ4P88 line (Figure 4A). NANOGP9 and NANOGP10 are on the X chromosome, and were duplicated in a subpopulation of the late passage UC06P112 sample (Figure 4B). In terms of OCT4/POU5F1 pseudogenes, POU5F1P4 is located on chromosome 1, which was trisomic in the WA07P95 sample, POU5F1P6 is located in a region of chromosome 3 that is duplicated in the SIVF002P17 and the MEL2P13 samples, and POU5F1P3 is located on chromosome 12, which was trisomic in samples from 5 hESC lines (Figure 3). The ESI051P37 sample is interesting, in that it possessed a large deletion that encompasses the POU5F1 and NANOGP3 genes. There is little known about the role that transcribed pseudogenes may play in cellular function. In one report (Hirotsune et al., 2003), a pseudogene was shown to stabilize the transcript of its protein-coding homolog, although its mechanism of action was unclear. It is intriguing to speculate that the pseudogenes of the pluripotency-associated genes may exert positive or negative regulatory influence over these genes.
We observed cases where duplications appeared and took over hESC cultures. In the MIZ4 line, there was evidence that a trisomy of chromosome 15 had arisen in a subpopulation of cells between passage 33 and passage 88 (Figure 4A). In the UC06 line, the subpopulation of cells that had a trisomy of the X chromosome at passage 59 had taken over a larger proportion of the population by passage 112 (Figure 4B). These instances highlighted the need for improved methods for detecting CNVs in mosaic populations of cells. We analyzed mixtures of cells, where we varied the proportion of HDF51IPS11P33 cells, which contain a duplication in chromosome 20, and the parental HDF51 fibroblast line, which is genomically normal in this region. Using CNVPartition, we were able to detect the presence of the duplication when the percentage of HDF51IPS11P33 cells was ≥70% of the cells. However, calculating BAF distance can be used to detect the presence of the duplication when ≥20% of the population is affected (Figure 5B and Figure S5A), indicating that improvements in CNV calling algorithms may be possible and would be very useful.
hiPSCs present an ideal system for distinguishing between the effects of reprogramming and passage on genomic stability. They also confer the ability to determine with certainty whether a given alteration is new, since the parental differentiated cells can also be analyzed. Accordingly, we analyzed 3 samples from a primary human fetal fibroblast line, HDF51, and 12 independent hiPSC clones generated from that line. For the hiPSC clones, we collected samples at early (passage 5-8), mid (passage 12-15), and late (passage 25-34) passage, and analyzed at least the early and late passage samples. In addition to identifying duplications using CNVPartition, we identified deletions using a combination of CNVPartition and replicate error detection, which identifies the discrepancies between SNP calls from two samples (Table S6). Since all of the samples originated from the same individual, the replicate error detection represented a way of improving our confidence in our deletion calls. Inspecting the duplication and deletion calls for the HDF51 and HDF51IPS samples (Figure 6), we noticed that all 11 deletions appeared by the earliest passage timepoint, while 5 out of 6 duplications arose during the course of long-term passage. In fact, some of the deletions receded from the population over long-term passage, suggesting that they were positively selected during reprogramming and negatively selected during passage (Figure S6).
Of the 7 duplicated regions that were present in an HDF51IPS line, but absent from the parental HDF51s, 6 contained the coding region and/or promoter region of at least one gene. The overexpression of five of these genes (in red) were positively associated with tumorigenicity or cell proliferation, while for one (FRS2, in green) low expression has been associated with poor prognosis in non-small cell lung cancer (Iejima et al., 2010). BCL2L1 (in orange) has two isoforms, one of which is pro-apoptotic and the other is anti-apoptotic (Boise et al., 1993). All 12 deletion regions overlapped at least one gene, and 5 of them contained genes that have evidence of tumor-suppressor activity.
It is notable that the presence of the transduced copies of the reprogramming factors did not confound our analysis by appearing as duplications in the reprogramming genes. This is due to the facts that the transduced genes included only the coding sequences (which have few SNPs), and that to identify a CNV region the CNV-calling algorithms require longer stretches of consecutive SNPs to be affected.
The most rapidly arising genomic aberrations in our dataset were identified in samples from a directed differentiation experiment. Parallel differentiations were performed using WA07 cells at passage 95 and 96, with samples collected from the undifferentiated cells (WA07P95), on differentiation day 2 (WA07P95CMD2 and WA07P96CMD2), and differentiation day 7 (WA07P95CMD7 and WA07P96CMD7). Partial duplications of 3 segments of chromosome 20 were found in the WA07P96CMD7 sample only (Figure 5A and Figure S5B), indicating that they arose between day 2 and day 7 of differentiation. Comparing the BAF plots for WA07P96CMD7 to those from mixtures of known ratios of cells with and without a duplication of a smaller region of chromosome 20 (Figure 5B and Figure S5A), we estimated the percent of cells in the population carrying duplications of the 3 segments to be 20%, 100%, and 50%. This finding points out that differentiation can be a highly selective process, and that genomically aberrant cells can rapidly take over a population undergoing differentiation. We suggest that it is important to assess the genomic normality of cells frequently, not only in the pluripotent state, but also at the endpoint of differentiation experiments or other treatments.
To determine whether the regions of frequent duplication in hESCs might have common features linked to the pluripotent phenotype, we used our large-scale mRNA expression database, which contains gene expression levels for a large number of pluripotent and non-pluripotent cell lines. We found that many of the genes in the recurrently duplicated region on chromosome 12 were more highly expressed in human pluripotent cells compared to multiple non-pluripotent cell types (Figure S2 and S3A). There was not a statistically significant difference in the expression of these genes between in the hPSC samples that contained duplications and those that did not. However, this result could have been confounded by the differences in genetic background and culture conditions among the lines.
We therefore examined the expression of genes found within areas of duplication in samples where we had genetically matched controls (Figure S3). There was higher expression of many genes on chromosome 20 in the WA07P96CMD7 samples, which had partial duplications of large stretches of this chromosome (shown in the BAF plot on the lower panel of Figure S3A), compared to the WA07P95CMD7 samples, which were euploid for this chromosome. One of the genes that were most highly affected was DNMT3B, as seen on the panel on the right. We noted that the higher expression was not restricted to the areas involved in the duplications, indicating potential long-range effects of chromosomal aberrations on gene expression. These effects appeared to be weaker, but still present, on other chromosomes (see chromosome 12 panel). We ensured that this effect was not simply due to variations in gene expression between biological replicates by examining the corresponding data for the samples collected at day 0 and day 2 of the same experiment (upper 2 panels of Figure S3A). We also had matched controls for the HDF51IPS lines, and we did see correlation between gene expression and presence of duplications for these samples as well (Figure S3B). These findings suggest that duplications do result in increases in gene expression, both at the site of duplication as well as at distant sites, which can be detected when a genetically matched sample is used for comparison. Even though these gene expression changes are not apparent when comparing samples from unrelated cell lines, we believe this is not relevant, as a cell containing a genomic aberration is competing in culture with a population of otherwise genetically matched cells.
This study represents the most comprehensive and highest resolution study of the genomic stability of hPSCs to date, and includes samples from a large number of both hESCs and hiPSCs, as well as somatic stem cells, primary cell lines, and tissues for comparison. In addition, we analyzed a primary HFF line and 12 hiPSC clones generated from it, collected at early and late passage, which allowed us to distinguish between genomic aberrations that arose during derivation versus long-term culture of hiPSCs.
This study is unique in combining a sufficient numbers of both pluripotent and non-pluripotent samples to detect cell-type-specific recurrent genomic aberrations in a statistically significant manner and a high-resolution analysis platform that enables the detection of kilobase-length aberrations. A recently published study using gene expression data to detect genomic aberrations did not include non-pluripotent samples for comparison, and was limited to detection of duplications at least 10 megabases in length (Mayshar et al., 2010). In our results, >90% of duplications in hPSCs and 100% of duplications in non-hPSCs were <10 megabases (Figure 42), indicating that gene expression-based methods are unable to detect small genomic aberrations. Moreover, the genomic locations assigned using gene expression data correspond to the location of the coding sequences of the perturbed genes, rather than the actual genomic coordinates of the genomic aberrations.
The results presented here indicate that hESC lines contain numerous genomic aberrations, most of which would not be detected by karyotyping or other microscopy-based methods. Some regions of CNV occurred multiple times in unrelated hESC and hiPSC lines, suggesting that certain changes may be characteristic of self-renewing pluripotent cells. It should be noted that it was not possible to establish with certainty the stage at which the genomic changes occurred in the hESC samples for which there was not an earlier passage sample demonstrating genomic normality; some of the abnormalities may have been present in the embryos from which the cells were derived. The analysis of hiPSCs does not suffer from this shortcoming, provided that the parental cells collected prior to reprogramming are analyzed. It is also important to consider other differences between hPSCs and cultured somatic cells. In general, because they do not undergo senescence, the hPSC lines had been in continuous culture longer than the primary cell lines, so some of the genetic changes seen may be a function of the selection pressures of cell culture in general, rather than specific to pluripotent stem cell culture.
The relatively high frequency of duplications in hPSCs raises the concern that these genetic aberrations may increase the risk of oncogenesis. The recurrent regions of copy number variation on chromosomes 12 and 20, which lie in close proximity to known pluripotency genes, are particularly worrisome, since a major issue in cell therapy is the elimination of pluripotent precursors in populations destined for transplantation. 3 out of the 10 duplications on chromosome 12, and 9 out of 10 duplications on chromosome 20, developed over the course of long-term culture of hPSCs, raising the concern that expansion of pluripotent cells may inevitably lead to increased genetic abnormality. However, the NANOG and NANOGP1 duplications were seen in cell lines as early as passage 21 (HUES7), 21 (HUES13), and 28 (HES2), which suggests that low passage number does not in itself ensure genetic integrity. Our data indicate that the pattern of genomic aberrations in hiPSCs and hESCs may differ slightly, but that both cell types are prone to developing such changes, and that one of the two most significant recurrent duplications seen in hESCs, on chromosome 20, was also found in one of the hiPSC lines. The other region of recurrent duplication, encompassing the NANOG/NANOGP1 region of chromosome 12, was detected in a late passage hiPSC line using array CGH by Chin et al. (Chin et al., 2009).
Our results and those of others (Lefort et al., 2008; Maitra et al., 2005; Mayshar et al., 2010; Spits et al., 2008; Wu et al., 2008) highlight the need for optimization of derivation and culture conditions that promote genetic stability of pluripotent stem cells. These results also underscore the need to perform further studies that include larger numbers of pluripotent cell lines and careful phenotypic assessments in order to distinguish genetic variations that are harmless from those that pose clinical risks. The evidence for accumulation of genetic aberrations in culture of existing hPSC lines makes it clear that new hPSC lines need to be generated now and on a continuing basis, and emphasizes the necessity of frequent assessments of genomic stability in hPSC lines, both in the pluripotent state and when the cells are subjected to other potentially selective conditions, such as differentiation procedures.
All cell types were derived and propagated as described in the references listed in Table S1.
Genomic DNA was purified using the DNeasy Blood & Tissue Kit (Qiagen).
SNP Genotyping was performed on the Illumina OmniQuad version 1, which interrogates 1,140,419 SNPs across the human genome. 1 ug input genomic DNA (the yield from approximately 200,000 cells) was amplified and labeled according to the manufacturer's instructions. The DNA was quantified using the PicoGreen reagent (Invitrogen, Inc.). The labeled product was then hybridized to the array and scanned on a BeadArray Reader (Illumina, Inc.). Genotyping calls were made using BeadStudio (Illumina, Inc.), using the standard cluster files provided by the manufacturer. The GenCall threshold was set to 0.15, and the call rates were between 0.979 and 0.999.
For the SNP Genotyping data, data preprocessing was performed in BeadScan (Illumina, Inc.). Data cleaning, SNP calling, and replicate error identification was performed in GenomeStudio (Illumina, Inc.). CNVPartition v2.4.4 (Illumina, 2008) was used as the primary CNV-calling algorithm for the results presented in this paper. CNV regions were also identified using the SNPRank Segmentation aligorithm in Nexus (Biodiscovery, Inc.) to assess concordance between the two methods. The CNVPartition CNV score threshold was set at 50, with a minimum number of SNPs per CNV region of 10. The Nexus parameters included a significance threshold of 1×10-8 and a minimum number of probes per segment of 10.
We chose to remove data from probes on the array that were designated as “CNV” probes prior to using CNVPartition and Nexus. We did this for two reasons: first, the CNV probes were designed as monoallelic probes, and hence provide no B-allele-frequency information, potentially reducing their accuracy in calling duplications and deletions; second, we were interested in detecting genomic aberrations that occurred with derivation and passage of cell lines (and potentially with tissue-specific differentiation), rather than CNVs that vary among individuals and are carried in the germline, which are the ones targeted by the CNV probes.
Since the average spacing of SNPs on the SNP Genotyping microarrays used was 3 kb, the shortest detectable CNV regions were expected to be approximately 30 kb. These two algorithms generally identified similar regions of duplication (97% agreement on the individual SNP level for 3-copy duplications, and 99% for 1-copy deletions) (Table S1 and Table S2).
An overlap was identified between the CNVs called by CNVPartition in our dataset and the common CNVs observed in 450 reference HapMap samples (Conrad et al., 2010) when the common region of the CNVs exceeded both 20% of the CNV identified in our samples and 20% of the common CNV in the reference set. The CNVs in our dataset that overlap with common CNVs are indicated by an asterisk in Figure 1, and were also removed from subsequent analyses.
CNV calls for CNVPartition and Nexus were validated by performing qRT-PCR for a subset of the CNV calls. TaqMan CNV assays (Life Technologies, Inc.) were performed according to the manufacturer's instructions. Assays were performed in triplicate, with the HDF51IPS1P25 sample used as the reference. The predicted copy number was calculated using the equation: CN= 2* (2^(-(DeltaDeltaCt))).
Since replicate errors could only be identified where samples were derived from the same original cell line, replicate error calling was only performed for the HDF51-derived lines. For these samples, SNP calls were validated by performing qPCR for a subset of the loci where replicate errors were called. TaqMan SNP assays (Life Technologies, Inc.) were performed according to the manufacturer's instructions. The HDF51P11 sample was used as the reference. There were 8 homozygous-to-homozygous replicate errors identified (0/4 tested were confirmed), 313 homozygous-to-heterozygous replicate errors (0/14 were confirmed), and 310 heterozygous-to-homozygous replicate errors (11/11 were confirmed) (Table S3). These results indicate that the large majority of apparent SNP mutations identified by replicate error analysis are in fact due to SNP genotyping error; this result is not unexpected based on reports that the discrepancy between SNP calls by sequencing and microarray-based SNP genotyping is ~ 0.1%-0.05% (Bentley et al., 2008). Based on the average number of heterozygous and homozygous SNPs in the SNP genotyping data (~20% heterozygous and 80% homozygous), we would have expected an excess of homozygous-to-heterozygous replicate error calls. The reason for the larger than expected number of heterozygous-to-homozygous calls was due to the fact that deletions and some duplications (when the cluster separation is poor) appear to result in replicate error calls; this is also the reason that heterozygous-to-homozygous replicate error calls are also expected to be better validated.
For intervals of interest, homozygous SNPs were removed by eliminating SNPs with BAF values >0.8 or <0.2. The heterozygous SNPs were separated into two clusters, using the median BAF value of the heterozygous SNPs as a cutoff. The “AAB” cluster had BAF values < median BAF, and the “ABB” cluster had BAF values > median BAF. The difference between the mean BAF for the AAB cluster and the mean BAF for the ABB cluster was the BAF distance.
We would like to acknowledge all of the collaborators who contributed samples to this study, including Eirini Papapetrou (Sadelain lab), Dongbao Chen, Ralph Graichen, Jerold Chun, Martin Pera, James Shen, Scott McKercher, Timo Otonkoski, Sheng Ding, and Vasiliy Galat. We would like to thank Gulsah Altun for invaluable assistance. We would like to thank the NICHD Brain and Tissue Bank for Developmental Disorders, Planned Parenthood of San Diego and Riverside Counties, and Christopher Barry for generously providing tissue specimens for this study. LCL was supported by an NIH/NICHD K12 Career Development Award and the Hartwell Foundation. JFL, IS, HT, CL, and FJM are supported by CIRM (CL1-00502, RT1-01108, TR1-01250, RN2-00931-1), NIH (R21MH087925), the Millipore Foundation, and the Esther O'Keefe Foundation. IU was supported in part by a fellowship from the Edmond J. Safra foundation in Tel Aviv University and by the Legacy stem cell research fund. IL was supported by the PEW Charitable Trust. HSP and SL were supported by a SCRC Grant (SC2250) of the 21st Century Frontier Research Program funded by the Ministry of Education, Science and Technology. MJB was partially supported by grants RYC-2007-01510 and SAF2009-08588 from the Ministerio de Ciencia e Innovación of Spain. Work in the laboratory of JCIB was supported by grants from MICINN Fundacion Cellex, the G. Harold and Leila Y. Mathers Charitable Foundation and Sanofi-Aventis. CM was supported by NIH grants R01 HL64387, P01 HL094374, R01 HL084642, and P01 GM081719. VG was partially supported by NHLBI, RC1HL100168. RS was supported in part by the Israel Science Foundation (grant no. 802/08). ALL was supported by grants from the Australian Stem Cell Centre and from the Victoria-California Stem Cell Alliance (TR101250) between CIRM and the state government of Victoria, Australia.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.