|Home | About | Journals | Submit | Contact Us | Français|
Next-generation sequencing provides a high-resolution insight into human genetic information. However, the focus of previous studies has primarily been on low-coverage data due to the high cost of sequencing. Although the 1000 Genomes Project and the Haplotype Reference Consortium have both provided powerful reference panels for imputation, low-frequency and novel variants remain difficult to discover and call with accuracy on the basis of low-coverage data. Deep sequencing provides an optimal solution for the problem of these low-frequency and novel variants. Although whole-exome sequencing is also a viable choice for exome regions, it cannot account for noncoding regions, sometimes resulting in the absence of important, causal variants. For Han Chinese populations, the majority of variants have been discovered based upon low-coverage data from the 1000 Genomes Project. However, high-coverage, whole-genome sequencing data are limited for any population, and a large amount of low-frequency, population-specific variants remain uncharacterized. We have performed whole-genome sequencing at a high depth (~×80) of 90 unrelated individuals of Chinese ancestry, collected from the 1000 Genomes Project samples, including 45 Northern Han Chinese and 45 Southern Han Chinese samples. Eighty-three of these 90 have been sequenced by the 1000 Genomes Project. We have identified 12 568 804 single nucleotide polymorphisms, 2 074 210 short InDels, and 26 142 structural variations from these 90 samples. Compared to the Han Chinese data from the 1000 Genomes Project, we have found 7 000 629 novel variants with low frequency (defined as minor allele frequency < 5%), including 5 813 503 single nucleotide polymorphisms, 1 169 199 InDels, and 17 927 structural variants. Using deep sequencing data, we have built a greatly expanded spectrum of genetic variation for the Han Chinese genome. Compared to the 1000 Genomes Project, these Han Chinese deep sequencing data enhance the characterization of a large number of low-frequency, novel variants. This will be a valuable resource for promoting Chinese genetics research and medical development. Additionally, it will provide a valuable supplement to the 1000 Genomes Project, as well as to other human genome projects.
Next-generation sequencing has become widely utilized in human genetics research compared to previous technologies, in particular for genome-wide association studies (GWAS). The 1000 Genomes Project (1000GP) has distributed a standard pattern of more than 88 million variants, providing for research use a global genetic reference panel [1–3]. The Haplotype Reference Consortium (HRC) has also constructed a distinct human reference panel, consisting of 39 235 157 single nucleotide polymorphisms (SNPs) . However, most of the samples contributed to either the 1000GP or to the HRC have an average sequencing depth of only ×4~8, which makes characterization of low-frequency variants (minor allele frequency [MAF] < 5%), and especially rare variants (MAF < 1%) , difficult. These 2 projects therefore cannot supply a high-resolution spectrum of variations for many human populations, in particular for Han Chinese .
Low-coverage sequencing can be used to generate a high-quality variation set supplemented by imputation, though imputation will perform poorly in correctly calling rare variants, using the current set of typed markers and reference panels [6, 7]. Whole-exome sequencing (WES) is a viable method to elucidate associations between rare variants and human diseases, when these are to be found in coding regions; however, WES cannot characterize non-coding regions, which encompass 98% of the human genome and are increasingly recognized to play an important role in some human traits [8, 9]. The 1000GP, the HRC and other human genome projects have generated extensive human variation catalogues, which can be used to design high-density genotyping arrays. However, these chips are likely to miss rare or low-frequency alleles . Furthermore, for Han Chinese genomes, a large number of population-specific variants can be expected to be absent from the imputation variant panel set of 1000GP. Therefore, it is necessary to further characterize the Han genome using high-depth whole-genome sequencing.
Besides SNPs and InDels, structural variants (SVs) have also been found in recent studies [10, 11] to be associated with human diseases that contribute to human genetic diversity; 68 818 SVs have already been detected using the data of the 1000GP, and an integrated SV map for global human population has been generated . However, SV identification using short reads from low-coverage sequencing data remains challenging . Assembly-based SV calling based on high-coverage sequencing data provides a feasible and powerful method .
In this study, we have sequenced the genomes of 90 Han Chinese samples extracted from 1000GP, at an average sequencing depth of ~×80, including 7 hitherto un-sequenced samples from 1000GP. We have built a high-resolution spectrum of genetic variation for the Han Chinese population, based on high-coverage genomic data, including SNPs, InDels, and SVs. These data provide a valuable resource for performing further genomics/genetics studies in Han Chinese populations, especially for exploring the functional roles of these low-frequency novel variants. Additionally, as a valuable supplement, the data will enrich genetic variation catalogues of global human populations.
Genomic DNA was extracted from the cell lines of 90 unrelated Chinese samples from the 1000GP, currently deposited at Coriell Institute for Medical Research. Forty-five of these 90 samples were taken from Southern Han Chinese (CHS) individuals, and the remaining 45 were taken from Han Chinese individuals from Beijing (CHB). Of these samples, seven were hitherto unsequenced by the1000GP, including 5 CHB and 2 CHS.
All samples used in this study were obtained from the 1000 Genomes Project cell line collection at the Coriell cell line repository and were qualified by the ethics protocols of the 1000GP. According to the statement of informed consent, all samples held by the 1000GP, including these 90 Chinese individuals, have been released and shared to the public. All individuals have consented to allowing their genomic data being used in the analyses of the project and being freely distributed for future research. Public distribution of the sequencing data and genotypes has also been explicitly consented to. This study has been also approved by the Institutional Review Board on Bioethics and Biosafety (reference number: BGI-IRB 16 101).
Illumina HiSeq 2000 Paired Library preparation was done in accordance with the manufacturer's instructions (Illumina, San Diego, CA, USA). We performed cluster generation using the Illumina cluster station, with workflow as follows: template hybridization, isothermal amplification, linearization, blocking, denaturation, and sequencing primer hybridization. Fluorescent images were processed to sequences using the standard Illumina base-calling pipeline. We built 5 ranks of DNA libraries with different insert sizes (170 bp, 500 bp, 2 kb, 5 kb, 10 kb, 20 kb) (Table 1). The average depth of CHS was 71.87 ± 23.52, and that of CHB was 82.36 ± 14.13. The average genome coverage of CHS was 99.65% ± 0.34%, and that of CHB was 99.60% ± 0.30% (Table 2).
Read alignment was performed using the aln algorithm of BWA 0.6.1 (BWA, RRID:SCR_010910) , using reads with insert sizes ranging from 180 bp to 500 bp, to align against the human genome (hg19/GRCh37). We used Genome Analysis Tool Kit (v. 2.7.1; GATK, RRID:SCR_001876) [16, 17] to remove duplications, realign around InDels, and recalibrate alignment quality scores. SNPs and InDels calling was performed using GATK UnifiedGenotyper. After Variant Quality Score Recalibration filtering, we ultimately obtained 12 568 804 SNPs and 2 074 210 InDels. The step-by-step procedures and command lines of the variant calling procedures have been saved in detail on the protocols.io platform . Of these variants, 12 536 571 SNPs and 1 472 925 InDels were bi-allelic, 4 573 367 were rare variants (MAF ≤ 1%), 3 013 158 were low frequency (1% < MAF ≤ 5%), and 6 422 971 were common (MAF > 5%). Gene-based annotations for SNPs and InDels were also performed using ANNOVAR (ANNOVAR, RRID:SCR_012821)  based on the reference genome of Hg19. We found that 5 651 762 SNPs and 982 480 InDels were distributed in gene regions, respectively, and that more than half of the variants were found in the intergenic regions (Table 3).
To evaluate the SNP set, we first randomly selected 22 samples and performed genotyping using the Illumina Infinium OmniZhongHua-8 v1 BeadChip; 407 040 ± 2635 SNPs were obtained, and the concordance rate was 99.94% ± 0.02% (false discovery rate [FDR] = 0.06% ± 0.02%). We then compared our SNP set with several public genotype data sets, including Affymetrix Affy 6.0, Illumina Omni 2.5 arrays, and the variation set of the 1000GP (phase III). For Affymetrix Affy 6.0 and Illumina Omni 2.5 arrays, we compared the variants of 86 individuals with genotyping overlap between the 2 platforms. As described in the comparison with Illumina Infinium OmniZhongHua-8 v. 1 BeadChip array, FDR rates were both below 0.1% (Table 4). Finally, we ran a comparison between our SNP set and the 1000GP (phase III) SNP set and counted the SNP set overlap between these 2 sets, enumerating the FDR to be 0.32% ± 0.07%. For InDels, we also compared 2 sets using the same method, for which the FDR was 2.72% ± 0.15% (Table 4).
In order to highlight the strengths of deep sequencing and to mine potential novel variants particular to Han Chinese, we compared our variant set to the respective sets of the 1000GP, the Han Chinese subset of the 1000GP, and the dbSNP (build 147). The results showed that a large amount of novel variants were found in all 3 comparisons. In the comparison to the set of the Han Chinese, the 1000GP, 13.17% of the SNPs, and 43.99% of the InDels proved novel (Fig. (Fig.1).1). Among these novel variants, more than 85% of the SNPs and 51% of the InDels can be classified as low frequency or rare, and the proportion of rare variants was larger than that of low-frequency variants. This feature of the variants was particularly visible in the comparison between our set and Han Chinese set of 1000GP, in which the low-frequency SNPs (MAF < 5%) account for 91.7% of the novel set, and in which the rare variants are as high a fraction as 56.8% (Fig. (Fig.22).
In summary, the deep genome sequencing of 90 Han Chinese individuals presents a powerful performance in characterizing novel variants, of which the majority are low frequency or rare. It also lays a foundation to explore the potential functions of these variants in further studies.
We used SOAPdenovo2 (SOAPdenovo2, RRID:SCR_014986)  to assemble the genome of each individual, based on libraries with hierarchical insert size. For the genomic data of each individual, we performed several analysis steps before genome assembly. These steps are as follows: (i) filtering of low-quality reads, correcting of base calling errors; (ii) filtering out reads with adapters (match length ≥10 bp, mismatch ≤ 3); (iii) filtering out reads with fractions of n larger than 10%; (iv) filtering out reads with low-quality base rates in excess of 40%; (v) removal of duplicated reads produced in polymerase chain reaction amplification; (vi) calculation of k-mer frequency of all reads in order to generate frequency tables; and (vii) removal of reads with low-frequency k-mers. Detailed protocol information about de novo assembly has been described and saved on protocols.io .
We finally used reads with insert sizes of less than 2k to assemble the contigs, and we used all reads to assemble scaffolds. The k-mer size of de novo assembly was set as 63, and the merge level was 2. Counting the assembled genomes, the average genome size was 2 951 301 058 ± 12 168 854 bps, the average contigs N50 was 2865 ± 97 bps, and the average contig size was 49 339 ± 6088 bps.
We here applied an integrative strategy to identify the SVs of Han Chinese people, including multiple current algorithms. The assembly-based method (SOAPsv)  was firstly employed in SV calling. We finally obtained a total of 26 142 SVs, containing 12 772 insertions and 13 370 deletions. The average number for each individual is 3102 ± 190. Besides this, several other methods were then applied to call SVs, including Pindel , CNVnator , Breakdancer , and Genome STRiP . More detailed protocol information on SV calling can be found on the protocols.io platform . We then merged all of the deletions in several SV sets according to their breakpoints to obtain an integrative deletion set. These deletions were genotyped in the tool Genome STRiP. We then discarded the SVs with large proportions of un-qualified genotypes (>10%), which included genotypes of low quality (phred-scaled likelihood score lower than 13) and those that were missing. After the above steps, a total of 24 369 deletions passed this filtering. After annotation by RepeatMasker (RepeatMasker, RRID:SCR_012954) , more than 70% of these deletions were found to be in the simple repeat (31.23%), the Alu (23.52%), and the L1 repeat elements (15.66%) (Fig. (Fig.33).
Length distribution showed an enrichment of short deletions (<5 Kb), accounting for 76.77% of all deletions. Only 0.45% of these proved larger than 500 Kb. Based on the criterion of 50% reciprocal overlap , we found that 61.58% of our deletion set was novel with respect to the SVs of the 1000GP; 20% of the novel deletions were low frequency (MAF < 5%). We then evaluated the genotype concordance by comparing with the SV set of the 1000GP. We compared genotypes of the overlapping SVs in both sets. Concordance rates reached 89.45% ± 0.004%. It was as high as 94.90% when the proportion of reciprocal overlap was set at 85% (Fig. (Fig.44).
We built a more comprehensive SV catalogue for Han Chinese people using the above integrative strategy, including assembly-based method, Pindel, CNVnator, Breakdancer, and Genome STRiP. This catalogue harbors significantly more SVs than that of the Han Chinese from 1000GP (~7700), and the majority of them are novel (17 927, 73.56%). The set overlap in SVs between the 2 sets evidences high concordance of genotype. This provides a valuable data panel for use in research about human diversity and genetic diseases.
Project name: WGS of Han Chinese genomes
Project home page: https://github.com/HaoxiangLin/WGS_of_Han_Chinese_genomes
Operating system(s): Linux
Programming language: Shell, Perl, Java, and C++
Other requirements: BWA, 0.6.1; SOAPsv, 1.02; SOAPdenovo2; PINDEL, 0.2.4t; cnvnator, 0.2.7; Breakdancer-max, 1.2; Genome STRIP, v. 1.0; SAMtools, 0.1.18; Picard, 1.61; GATK, 2.7; BEAGLE, v. 3
License: GNU General Public License v.3.0 (GPLv3)
The raw fastq format data were deposited at EBI with the project accession number PRJEB11005, and the secondary accession number ERP012319. VCFs have been archived at EMBL-EBI under accession number PRJEB20820. Datasets further supporting the results of this data note are available in the GigaScience database, GigaDB . Protocols are also available from protocols.io [18, 21, 27].
1000GP: 1000 Genomes Project; CHB: Han Chinese in Beijing; CHS: Southern Han Chinese; FDR: false discovery rate; GATK: Genome Analysis Tool Kit; GWAS: genome-wide association study; HRC: Haplotype Reference Consortium; MAF: minor allele frequency; SV: structural variant.
The authors declare that they have no competing interests.
X.G., X.X., H.Y., J.W., J.W., and X.L. conceived this project. X.G. and H.L. collected the samples, isolated the genomic DNA, and constructed the DNA libraries. T.L., H.L., X.G., W.Z., and M.Y. performed the genome analysis. L.C.A.M.T. provided advice. H.L., T.L., and X.G. helped deposit and curate the datasets in GigaDB. T.L., X.G., and L.C.A.M.T. wrote the article. All authors discussed the project and data. All authors read and approved the final manuscript.
We appreciate the support of the Shenzhen Municipal of Government of China (CXB201108250094A).