Overall design and concept
To develop an effective bioinformatics strategy for CNVs detection at the single cell level through massively parallel sequencing, we generated sequencing data using single cells isolated from peripheral blood (PB) or blastocysts. The single cells received a standard degenerate oligonucleotide primer PCR (DOP-PCR) and Illumina sequencing at BGI-Shenzhen. After a short read alignment, we first tried to describe the WGA-induced bias. Considering the influence of the unique differences in the human reference genome sequence, windows selection was performed and well-selected windows were employed as the minimum observation unit. We described the WGA-induced bias using a relative reads number (RRN) and a newly defined statistic, the GC-bias index (
). Based on this significant discovery, we developed a weighted correction strategy to lessen the influence of WGA-induced bias. Finally, a bioinformatics pipeline, which included GC correction for WGA-induced bias removal, a binary segmentation algorithm for CNV breakpoints identification, and dynamic threshold determination for a final signal filtering, was established to access more accurate CNVs.
Sample recruitment, sample processing and single cell isolation
A total of nine single cell samples from eight individuals were collected at BGI-Shenzhen, CITIC Xiangya Reproductive & Genetic Hospital, and Nanjing Maternal and Child Health Hospital, including two from PB of YH with normal karyotype, one from PB of an individual with Down's syndrome, one from PB of an individual with Cri-du-Chat (CDC) Syndrome, two from PB of an individual with 1p duplication and 20q duplication, and three trophectoderm cell samples from blastocysts. All those samples were karyotyped by G-banding or FISH in local hospitals before this study.
Each participant was recruited with informed written consent and approval of the Institutional Review Board of BGI-Shenzhen, for those trophectoderm samples the parents were also provided informed consent under the protocol approved by the Ethics Committee of the CITIC Xiangya Reproductive & Genetic Hospital. For three of the participants less than five years old, informed written consent was obtained from their guardians. All the potential participants or their guardian were provided with the detailed information of this study, including the benefits and risks of our technology development in written information consent form, and also were informed of the rights to privacy. The results of this study would not feedback to the participants or their guardian; therefore all potential participants or their guardian who declined to participate or otherwise did not participate were eligible for treatment and were not disadvantaged in any other way by not participating in this study.
5–10 mL of PB was collected in EDTA-coated tubes at BGI. 10 ul blood samples for single cell isolation were frozen in −20°C. The blood was thawed and washed with 500 ul of PBS with centrifuged at 1,000 rpm for 10 minutes at room temperature. The supernatant was discarded, and the nucleated cells were re-suspended in PBS.
PBS with 5% BSA was used for droplet preparation in culture dishes. We used mouth-controlled pipettes to transfer and dilute the nucleated cells suspension in droplets under a microscope, and isolated a single nucleated cell into 200 ul tubes containing 1.5 ul of alkaline lysis buffer. The tubes with single cells were frozen in −80°C until further processing.
For blastocyst biopsies, the embryos expanded trophectoderm (TE) protruding through the opened zona on day 5 or 6 were chosen. Three to eight herniating TE cells were aspirated into the lumen of a pipette (internal diameter: 30 µm) and detached from the blastocyst by firing a laser at the area of constriction. They were then washed several times in wash buffer and preserved in 200 ul PCR tubes.
Data generation and basic process
The single cells were amplified with the GenomePlex Single Cell Whole Genome Amplification Kit (Sigma Aldrich) according to the manufacturer's instructions. The cell WGA was quality controlled by PCR with primers for housekeeping genes including PRDX6, RPL37a, ADD1, PSMD7, and ATP5O.
In total, two YH cells (YH-1 & YH-2), one T21 cell (SC 1), one CDC cell (SC 2), two micro-duplication cells (SC 6 & SC 7) and three single-cells from blastocysts (SC 3, SC 4, SC 5) were qualified for further process in this study. After WGA, the products were used to prepare a library of 350 bp insert size, and received whole genome sequencing in Hiseq2000 platform with single-end (SE) 50 bp. All the raw sequencing data had submitted to NCBI SRA (http://www.ncbi.nlm.nih.gov/sra
) and the Submission ID is SRA060638.
In the basic data process, reads after WGA-adapter removal were mapped to the reference human genome (Hg18, Build36) using SOAP2 
with maximum two mismatches [−v 2]. Afterwards, we removed low quality alignments, such as PCR duplications (i.e. the identical reads) and non-unique alignments.
SNP array and Whole genome sequencing (WGS) analysis
To evaluate the performance of our method, we used two different methods as gold standard, SNP array for single-cell samples and WGS for peripheral blood (PB) samples.
Van et al. reported the SNP array analysis can identify CNVs as small as 150 kb (>5 SNPs) for single cell 
. Therefore, it is an appropriate gold standard to compare the accuracy of detecting CNV larger than 1-Mb in this study. For three single-cell samples (sample SC3 to SC5), SNP array analysis was conducted according to the manufacturer's instructions. Each DNA sample was hybridized to the Gene Chip Mapping Nsp I 262K microarray (Affymetrix Inc., Santa Clara, CA). Approximately 260,000 SNP signal intensities for each test samples were compared computationally with averaged signals from 30 previously evaluated normal female reference samples. Copy number analysis was performed by Gene Chip Genotyping Analysis Software (GTYPE) with default parameters to minimize the noise hybrid signals.
Derek et al. showed that WGS method can achieve a sensitivity of approximately 100% when detected CNVs larger than 200-kb using ~10M reads 
. Thus, we employed WGS methods as gold standards for PB samples to detect CNV larger than 1-Mb. The rest six PB samples, including YH and the individuals with Down's syndrome, Cri-du-chat syndrome and two individuals with micro duplication syndrome, were used for DNA extraction and whole genome sequencing analysis with 350 bp insert size library and single-end 50 bp sequencing in Illumina HiSeq2000 platform, generating average 10M reads for each sample. The sequencing data was mapped into the reference genome (Hg18, Build 36.3) and non-unique mapped reads and PCR duplication were removed. Then, we recruited these two YH samples as controls to perform copy number analysis for the other four samples by SegSeq (parameters: -W 400 –a 1000 –b 10) 
To lessen uncertain factors caused by data volume and characteristics of the human reference genome, such as GC content and uniqueness, we performed an optimized dynamic observation windows selection. First, the average size of the windows was determined by considering the GC content that is characteristic of the human reference genome and our low coverage sequencing strategy. Since it had been reported that long DNA segments (>300 kb) have relatively homogeneous GC content 
, the average size should be less than 300 kb to access vivid GC content distribution for observation or correction. However, the low coverage sequencing strategy (about 10 million reads per sample) would lay more uncertain influence in smaller genomic regions. Therefore, we decided to divide the human genome into non-overlapping observation windows with an average size of about 150 kb. Second, after considering uniqueness, we performed a simulation to obtain the dynamic observation windows. We divided the reference genome (HG18, Build 36) into sliding SE50 simulated reads and mapped them to the genome with a maximum of two mismatches and reserved unique mapped reads. When constructing these windows, we allowed the windows to share the same simulated reads numbers instead of the same window size to achieve higher comparability of suitable expected sequencing reads number among windows. Finally, we got 18,743 observation windows, each sharing 140,000 simulated reads (Figure S3
WGA GC bias description and correction
To describe the GC bias in WGA, we defined several statistics, including the average sequencing GC content (
), average reference GC content (
) and unique mapped reads (
) in each window. Subscript
represent the different windows and samples, respectively. The windows with no reads and zero GC percent are ignored. Let
represents the relative reads number (RRN), which is calculated by
is the global average number of sequencing reads in each window on autosomes. The effect of GC bias on the RRN was defined as
, which implied the average deviation between the observed RRN to its expected, where
is calculated by the following formula:
represents the optimal prediction, which was obtained via a loess regression fit of the relative reads number (RRN) against the GC content, rounded to the nearest 0.5% increment. 
Based on this discovery, we developed a weighted correction strategy in the following way. The sequencing reads within the window
with GC content (
) were assigned with a weight
is the average number of sequencing reads in each window which is calculated for every 1% GC content in sequencing data (
) and every 1% GC content in reference genome (
). Then, the corrected reads number (
) in each window was achieved with the following formula,
. Afterwards, we defined the correction relative reads number (CRN,
is the global average CRN) as the normalized statistics for the following analysis.
Binary segmentation algorithm for CNVs breakpoints identification
After removing the WGA-induced bias, we developed a binary segmentation algorithm to detect CNVs breakpoints with low coverage massively parallel sequencing. The binary segmentation algorithm consisted of initialization and iterative merging between adjacent segments.
We calculated the significance of differences between the two sides of each window with a run-test
. Moreover, 100 CRNs were employed in the left and right side of each window for difference significance statistics and each window was assigned a p
-value. Then, the 3,000 windows with minimal p
-value were selected as ordered initialized candidate breakpoints (
Iterative merging between adjacent segments
was associated with a left segment from
, and a right segment from
. We estimated the difference between the left and right segment of each window with a run-test
, where the p
-value was denoted as
. The candidate breakpoint with the most insignificant difference (the largest p
-value) would be removed from
, indicating that segments of the two sides of this breakpoint were merged. The iteration calculation was performed until the p
-value of each breakpoint was less than the final p
-value cutoff (
), where we choose the final p
-value cutoff (
) from a control by the methods described below.
Dynamic threshold determination for final signal filtering
To minimize the false signals and misdiagnosis of CNVs, we defined a cutoff threshold for the average CRN between two breakpoints after segmentation. Since the regions with same sequencing GC content had similar variation trends in WGA, based on central-limit theorem, we calculated the lower and upper quantile (alpha
0.05) of CRNs with the same sequencing GC content as the deletion and duplication cutoff threshold respectively.
Evaluation of sensitivity and specificity for test samples
To assess the efficacy of this method for CNVs detection, we calculated the sensitivity and specificity through seven test samples in the following way:
represents the total length of CNVs detected by this method,
represents the length of true CNVs detected by this method, and
represents the length of CNVs confirmed by SNP-array or whole genome sequencing analysis.
Simulations and breakpoint precision analysis in silico
To get the overall sensitivity and specificity of our method on call level, we performed a simulation on YH genome in silico
. (I) Candidate genomics region selection.
We first filtered the N regions on YH genome. Then we randomly selected regions as candidate regions for further CNV simulation. To mimic the influence of YH “true CNVs” in further simulation, we analyzed these candidate regions using WGS data of YH PB with SegSeq and excluded those regions overlapped with “true CNVs” on YH genome. The remained candidate regions would be used to CNVs simulation. (II) CNV simulation.
We then randomly simulated CNVs ranging from 500 kb to 5 Mb based the candidate regions above on YH-1 and YH-2. We simulated totally 100 CNVs for each size. Under the consideration of sequence volume, we extract 10M sequence reads from YH-1 and YH-2 respectively. RRNs on selected regions would multiply by 0.5 (deletion) or 1.5 (duplication) directly. Standard GC correction, binary segmentation and dynamic threshold determination were performed to detect these simulated CNVs. (III) Sensitivity/specificity statistics and breakpoint precision analysis.
The sensitivity and specificity were calculated by the following formula:
represent the number of simulated CNVs (100), true signals of CNVs and false signals of CNVs, respectively. Also, we calculated the minimum distance between the predicted and simulated CNV breakpoints to reveal the breakpoint precision of our method.