To understand the impact of gut microbes on human health and well-being it is crucial to assess their genetic potential. Here we describe the Illumina-based metagenomic sequencing, assembly and characterization of 3.3 million nonredundant microbial genes, derived from 576.7 Gb sequence, from faecal samples of 124 European individuals. The gene set, ~150 times larger than the human gene complement, contains an overwhelming majority of the prevalent microbial genes of the cohort and likely includes a large proportion of the prevalent human intestinal microbial genes. The genes are largely shared among individuals of the cohort. Over 99% of the genes are bacterial, suggesting that the entire cohort harbours between 1000 and 1150 prevalent bacterial species and each individual at least 160 such species, which are also largely shared. We define and describe the minimal gut metagenome and the minimal gut bacterial genome in terms of functions encoded by the gene set.
DA (D-blood group of Palm and Agouti, also known as Dark Agouti) and F344 (Fischer) are two inbred rat strains with differences in several phenotypes, including susceptibility to autoimmune disease models and inflammatory responses. While these strains have been extensively studied, little information is available about the DA and F344 genomes, as only the Brown Norway (BN) and spontaneously hypertensive rat strains have been sequenced to date. Here we report the sequencing of the DA and F344 genomes using next-generation Illumina paired-end read technology and the first de novo assembly of a rat genome. DA and F344 were sequenced with an average depth of 32-fold, covered 98.9% of the BN reference genome, and included 97.97% of known rat ESTs. New sequences could be assigned to 59 million positions with previously unknown data in the BN reference genome. Differences between DA, F344, and BN included 19 million positions in novel scaffolds, 4.09 million single nucleotide polymorphisms (SNPs) (including 1.37 million new SNPs), 458,224 short insertions and deletions, and 58,174 structural variants. Genetic differences between DA, F344, and BN, including high-impact SNPs and short insertions and deletions affecting >2500 genes, are likely to account for most of the phenotypic variation between these strains. The new DA and F344 genome sequencing data should facilitate gene discovery efforts in rat models of human disease.
BN; DA; F344; Rattus norvegicus; whole-genome sequencing; next-generation whole-genome sequencing (NGS)
In this study, a five-generation Chinese family (family F013) with progressive autosomal dominant hearing loss was mapped to a critical region spanning 28.54 Mb on chromosome 9q31.3-q34.3 by linkage analysis, which was a novel DFNA locus, assigned as DFNA56. In this interval, there were 398 annotated genes. Then, whole exome sequencing was applied in three patients and one normal individual from this family. Six single nucleotide variants and two indels were found co-segregated with the phenotypes. Then using mass spectrum (Sequenom, Inc.) to rank the eight sites, we found only the TNC gene be co-segregated with hearing loss in 53 subjects of F013. And this missense mutation (c.5317G>A, p.V1773M ) of TNC located exactly in the critical linked interval. Further screening to the coding region of this gene in 587 subjects with nonsyndromic hearing loss (NSHL) found a second missense mutation, c.5368A>T (p. T1796S), co-segregating with phenotype in the other family. These two mutations located in the conserved region of TNC and were absent in the 387 normal hearing individuals of matched geographical ancestry. Functional effects of the two mutations were predicted using SIFT and both mutations were deleterious. All these results supported that TNC may be the causal gene for the hearing loss inherited in these families. TNC encodes tenascin-C, a member of the extracellular matrix (ECM), is present in the basilar membrane (BM), and the osseous spiral lamina of the cochlea. It plays an important role in cochlear development. The up-regulated expression of TNC gene in tissue repair and neural regeneration was seen in human and zebrafish, and in sensory receptor recovery in the vestibular organ after ototoxic injury in birds. Then the absence of normal tenascin-C was supposed to cause irreversible injuries in cochlea and caused hearing loss.
The major histocompatibility complex (MHC) is one of the most variable and gene-dense regions of the human genome. Most studies of the MHC, and associated regions, focus on minor variants and HLA typing, many of which have been demonstrated to be associated with human disease susceptibility and metabolic pathways. However, the detection of variants in the MHC region, and diagnostic HLA typing, still lacks a coherent, standardized, cost effective and high coverage protocol of clinical quality and reliability. In this paper, we presented such a method for the accurate detection of minor variants and HLA types in the human MHC region, using high-throughput, high-coverage sequencing of target regions. A probe set was designed to template upon the 8 annotated human MHC haplotypes, and to encompass the 5 megabases (Mb) of the extended MHC region. We deployed our probes upon three, genetically diverse human samples for probe set evaluation, and sequencing data show that ∼97% of the MHC region, and over 99% of the genes in MHC region, are covered with sufficient depth and good evenness. 98% of genotypes called by this capture sequencing prove consistent with established HapMap genotypes. We have concurrently developed a one-step pipeline for calling any HLA type referenced in the IMGT/HLA database from this target capture sequencing data, which shows over 96% typing accuracy when deployed at 4 digital resolution. This cost-effective and highly accurate approach for variant detection and HLA typing in the MHC region may lend further insight into immune-mediated diseases studies, and may find clinical utility in transplantation medicine research. This one-step pipeline is released for general evaluation and use by the scientific community.
Genome-wide association studies have mainly relied on common HapMap sequence variations. Recently, sequencing approaches have allowed analysis of low frequency and rare variants in conjunction with common variants, thereby improving the search for functional variants and thus the understanding of the underlying biology of human traits and diseases. Here, we used a large Icelandic whole genome sequence dataset combined with Danish exome sequence data to gain insight into the genetic architecture of serum levels of vitamin B12 (B12) and folate. Up to 22.9 million sequence variants were analyzed in combined samples of 45,576 and 37,341 individuals with serum B12 and folate measurements, respectively. We found six novel loci associating with serum B12 (CD320, TCN2, ABCD4, MMAA, MMACHC) or folate levels (FOLR3) and confirmed seven loci for these traits (TCN1, FUT6, FUT2, CUBN, CLYBL, MUT, MTHFR). Conditional analyses established that four loci contain additional independent signals. Interestingly, 13 of the 18 identified variants were coding and 11 of the 13 target genes have known functions related to B12 and folate pathways. Contrary to epidemiological studies we did not find consistent association of the variants with cardiovascular diseases, cancers or Alzheimer's disease although some variants demonstrated pleiotropic effects. Although to some degree impeded by low statistical power for some of these conditions, these data suggest that sequence variants that contribute to the population diversity in serum B12 or folate levels do not modify the risk of developing these conditions. Yet, the study demonstrates the value of combining whole genome and exome sequencing approaches to ascertain the genetic and molecular architectures underlying quantitative trait associations.
Genome-wide association studies have in recent years revealed a wealth of common variants associated with common diseases and phenotypes. We took advantage of the advances in sequencing technologies to study the association of low frequency and rare variants in conjunction with common variants with serum levels of vitamin B12 (B12) and folate in Icelanders and Danes. We found 18 independent signals in 13 loci associated with serum B12 or folate levels. Interestingly, 13 of the 18 identified variants are coding and 11 of the 13 target genes have known functions related to B12 and folate pathways. These data indicate that the target genes at all of the loci have been identified. Epidemiological studies have shown a relationship between serum B12 and folate levels and the risk of cardiovascular diseases, cancers, and Alzheimer's disease. We investigated association between the identified variants and these diseases but did not find consistent association.
To tackle the exponentially increasing throughput of Next-Generation Sequencing (NGS), most of the existing short-read aligners can be configured to favor speed in trade of accuracy and sensitivity. SOAP3-dp, through leveraging the computational power of both CPU and GPU with optimized algorithms, delivers high speed and sensitivity simultaneously. Compared with widely adopted aligners including BWA, Bowtie2, SeqAlto, CUSHAW2, GEM and GPU-based aligners BarraCUDA and CUSHAW, SOAP3-dp was found to be two to tens of times faster, while maintaining the highest sensitivity and lowest false discovery rate (FDR) on Illumina reads with different lengths. Transcending its predecessor SOAP3, which does not allow gapped alignment, SOAP3-dp by default tolerates alignment similarity as low as 60%. Real data evaluation using human genome demonstrates SOAP3-dp's power to enable more authentic variants and longer Indels to be discovered. Fosmid sequencing shows a 9.1% FDR on newly discovered deletions. SOAP3-dp natively supports BAM file format and provides the same scoring scheme as BWA, which enables it to be integrated into existing analysis pipelines. SOAP3-dp has been deployed on Amazon-EC2, NIH-Biowulf and Tianhe-1A.
The applications of massively parallel sequencing technology to fetal cell-free DNA (cff-DNA) have brought new insight to non-invasive prenatal diagnosis. However, most previous research based on maternal plasma sequencing has been restricted to fetal aneuploidies. To detect specific parentally inherited mutations, invasive approaches to obtain fetal DNA are the current standard in the clinic because of the experimental complexity and resource consumption of previously reported non-invasive approaches.
Here, we present a simple and effective non-invasive method for accurate fetal genome recovery-assisted with parental haplotypes. The parental haplotype were firstly inferred using a combination strategy of trio and unrelated individuals. Assisted with the parental haplotype, we then employed a hidden Markov model to non-invasively recover the fetal genome through maternal plasma sequencing.
Using a sequence depth of approximately 44X against a an approximate 5.69% cff-DNA concentration, we non-invasively inferred fetal genotype and haplotype under different situations of parental heterozygosity. Our data show that 98.57%, 95.37%, and 98.45% of paternal autosome alleles, maternal autosome alleles, and maternal chromosome X in the fetal haplotypes, respectively, were recovered accurately. Additionally, we obtained efficient coverage or strong linkage of 96.65% of reported Mendelian-disorder genes and 98.90% of complex disease-associated markers.
Our method provides a useful strategy for non-invasive whole fetal genome recovery.
Copy number variations (CNVs), a common genomic mutation associated with various diseases, are important in research and clinical applications. Whole genome amplification (WGA) and massively parallel sequencing have been applied to single cell CNVs analysis, which provides new insight for the fields of biology and medicine. However, the WGA-induced bias significantly limits sensitivity and specificity for CNVs detection. Addressing these limitations, we developed a practical bioinformatic methodology for CNVs detection at the single cell level using low coverage massively parallel sequencing. This method consists of GC correction for WGA-induced bias removal, binary segmentation algorithm for locating CNVs breakpoints, and dynamic threshold determination for final signals filtering. Afterwards, we evaluated our method with seven test samples using low coverage sequencing (4∼9.5%). Four single-cell samples from peripheral blood, whose karyotypes were confirmed by whole genome sequencing analysis, were acquired. Three other test samples derived from blastocysts whose karyotypes were confirmed by SNP-array analysis were also recruited. The detection results for CNVs of larger than 1 Mb were highly consistent with confirmed results reaching 99.63% sensitivity and 97.71% specificity at base-pair level. Our study demonstrates the potential to overcome WGA-bias and to detect CNVs (>1 Mb) at the single cell level through low coverage massively parallel sequencing. It highlights the potential for CNVs research on single cells or limited DNA samples and may prove as a promising tool for research and clinical applications, such as pre-implantation genetic diagnosis/screening, fetal nucleated red blood cells research and cancer heterogeneity analysis.
There is a rapidly increasing amount of de novo genome assembly using next-generation sequencing (NGS) short reads; however, several big challenges remain to be overcome in order for this to be efficient and accurate. SOAPdenovo has been successfully applied to assemble many published genomes, but it still needs improvement in continuity, accuracy and coverage, especially in repeat regions.
To overcome these challenges, we have developed its successor, SOAPdenovo2, which has the advantage of a new algorithm design that reduces memory consumption in graph construction, resolves more repeat regions in contig assembly, increases coverage and length in scaffold construction, improves gap closing, and optimizes for large genome.
Benchmark using the Assemblathon1 and GAGE datasets showed that SOAPdenovo2 greatly surpasses its predecessor SOAPdenovo and is competitive to other assemblers on both assembly length and accuracy. We also provide an updated assembly version of the 2008 Asian (YH) genome using SOAPdenovo2. Here, the contig and scaffold N50 of the YH genome were ~20.9 kbp and ~22 Mbp, respectively, which is 3-fold and 50-fold longer than the first published version. The genome coverage increased from 81.16% to 93.91%, and memory consumption was ~2/3 lower during the point of largest memory consumption.
Genome; Assembly; Contig; Scaffold; Error correction; Gap-filling
Insertion and deletion polymorphisms (indels) are an important source of genomic variation in plant and animal genomes, but accurate genotyping from low-coverage and exome next-generation sequence data remains challenging. We introduce an efficient population clustering algorithm for diploids and polyploids which was tested on a dataset of 2000 exomes. Compared with existing methods, we report a 4-fold reduction in overall indel genotype error rates with a 9-fold reduction in low coverage regions.
Conventional prenatal screening tests, such as maternal serum tests and ultrasound scan, have limited resolution and accuracy.
We developed an advanced noninvasive prenatal diagnosis method based on massively parallel sequencing. The Noninvasive Fetal Trisomy (NIFTY) test, combines an optimized Student’s t-test with a locally weighted polynomial regression and binary hypotheses. We applied the NIFTY test to 903 pregnancies and compared the diagnostic results with those of full karyotyping.
16 of 16 trisomy 21, 12 of 12 trisomy 18, two of two trisomy 13, three of four 45, X, one of one XYY and two of two XXY abnormalities were correctly identified. But one false positive case of trisomy 18 and one false negative case of 45, X were observed. The test performed with 100% sensitivity and 99.9% specificity for autosomal aneuploidies and 85.7% sensitivity and 99.9% specificity for sex chromosomal aneuploidies. Compared with three previously reported z-score approaches with/without GC-bias removal and with internal control, the NIFTY test was more accurate and robust for the detection of both autosomal and sex chromosomal aneuploidies in fetuses.
Our study demonstrates a powerful and reliable methodology for noninvasive prenatal diagnosis.
Noninvasive Fetal Trisomy (NIFTY) test; Massively parallel sequencing; Autosomal aneuploidies; Sex chromosomal aneuploidies
It is evident that epigenetic factors, especially DNA methylation, play essential roles in obesity development. Using pig as a model, here we investigated the systematic association between DNA methylation and obesity. We sampled eight variant adipose and two distinct skeletal muscle tissues from three pig breeds living within comparable environments but displaying distinct fat level. We generated 1,381 gigabases (Gb) of sequence data from 180 methylated DNA immunoprecipitation (MeDIP) libraries, and provided a genome-wide DNA methylation map as well as a gene expression map for adipose and muscle studies. The analysis showed global similarity and difference among breeds, sexes and anatomic locations, and identified the differentially methylated regions (DMRs). The DMRs in promoters are highly associated with obesity development via expression repression of both known obesity-related genes and novel genes. This comprehensive map provides a solid basis for exploring epigenetic mechanisms of adipose deposition and muscle growth.
The pig is an economically important food source, amounting to approximately 40% of all meat consumed worldwide. Pigs also serve as an important model organism because of their similarity to humans at the anatomical, physiological and genetic level, making them very useful for studying a variety of human diseases. A pig strain of particular interest is the miniature pig, specifically the Wuzhishan pig (WZSP), as it has been extensively inbred. Its high level of homozygosity offers increased ease for selective breeding for specific traits and a more straightforward understanding of the genetic changes that underlie its biological characteristics. WZSP also serves as a promising means for applications in surgery, tissue engineering, and xenotransplantation. Here, we report the sequencing and analysis of an inbreeding WZSP genome.
Our results reveal some unique genomic features, including a relatively high level of homozygosity in the diploid genome, an unusual distribution of heterozygosity, an over-representation of tRNA-derived transposable elements, a small amount of porcine endogenous retrovirus, and a lack of type C retroviruses. In addition, we carried out systematic research on gene evolution, together with a detailed investigation of the counterparts of human drug target genes.
Our results provide the opportunity to more clearly define the genomic character of pig, which could enhance our ability to create more useful pig models.
Wuzhishan pig; Genome; Homozygosis; Transposable element; Endogenous retrovirus; Animal model
Motivation: Despite the prevalence of copy number variation (CNV) in the human genome, only a handful of confirmed associations have been reported between common CNVs and complex disease. This may be partially attributed to the difficulty in accurately genotyping CNVs in large cohorts using array-based technologies. Exome sequencing is now widely being applied to case–control cohorts and presents an exciting opportunity to look for common CNVs associated with disease.
Results: We developed ExoCNVTest: an exome sequencing analysis pipeline to identify disease-associated CNVs and to generate absolute copy number genotypes at putatively associated loci. Our method re-discovered the LCE3B_LCE3C CNV association with psoriasis (P-value = 5 × 10e−6) while controlling inflation of test statistics (λ < 1). ExoCNVTest-derived absolute CNV genotypes were 97.4% concordant with PCR-derived genotypes at this locus.
Availability and implementation: ExoCNVTest has been implemented in Java and R and is freely available from www1.imperial.ac.uk/medicine/people/l.coin/.
email@example.com or Lachlan.J.M.Coin@genomics.org.cn
Genome sequencing studies indicate that all humans carry many genetic variants predicted to cause loss of function (LoF) of protein-coding genes, suggesting unexpected redundancy in the human genome. Here we apply stringent filters to 2,951 putative LoF variants obtained from 185 human genomes to determine their true prevalence and properties. We estimate that human genomes typically contain ~100 genuine LoF variants with ~20 genes completely inactivated. We identify rare and likely deleterious LoF alleles, including 26 known and 21 predicted severe disease-causing variants, as well as common LoF variants in non-essential genes. We describe functional and evolutionary differences between LoF-tolerant and recessive disease genes, and a method for using these differences to prioritize candidate genes found in clinical sequencing studies.
Cancers arise through an evolutionary process in which cell populations are subjected to selection; however, to date, the process of bladder cancer, which is one of the most common cancers in the world, remains unknown at a single-cell level.
We carried out single-cell exome sequencing of 66 individual tumor cells from a muscle-invasive bladder transitional cell carcinoma (TCC). Analyses of the somatic mutant allele frequency spectrum and clonal structure revealed that the tumor cells were derived from a single ancestral cell, but that subsequent evolution occurred, leading to two distinct tumor cell subpopulations. By analyzing recurrently mutant genes in an additional cohort of 99 TCC tumors, we identified genes that might play roles in the maintenance of the ancestral clone and in the muscle-invasive capability of subclones of this bladder cancer, respectively.
This work provides a new approach of investigating the genetic details of bladder tumoral changes at the single-cell level and a new method for assessing bladder cancer evolution at a cell-population level.
Single-cell exome sequencing; Bladder cancer; Tumor evolution; Population genetics
We present a statistical framework for estimation and application of sample allele frequency spectra from New-Generation Sequencing (NGS) data. In this method, we first estimate the allele frequency spectrum using maximum likelihood. In contrast to previous methods, the likelihood function is calculated using a dynamic programming algorithm and numerically optimized using analytical derivatives. We then use a Bayesian method for estimating the sample allele frequency in a single site, and show how the method can be used for genotype calling and SNP calling. We also show how the method can be extended to various other cases including cases with deviations from Hardy-Weinberg equilibrium. We evaluate the statistical properties of the methods using simulations and by application to a real data set.
A major question in evolutionary biology is how natural selection has shaped patterns of genetic variation across the human genome. Previous work has documented a reduction in genetic diversity in regions of the genome with low recombination rates. However, it is unclear whether other summaries of genetic variation, like allele frequencies, are also correlated with recombination rate and whether these correlations can be explained solely by negative selection against deleterious mutations or whether positive selection acting on favorable alleles is also required. Here we attempt to address these questions by analyzing three different genome-wide resequencing datasets from European individuals. We document several significant correlations between different genomic features. In particular, we find that average minor allele frequency and diversity are reduced in regions of low recombination and that human diversity, human-chimp divergence, and average minor allele frequency are reduced near genes. Population genetic simulations show that either positive natural selection acting on favorable mutations or negative natural selection acting against deleterious mutations can explain these correlations. However, models with strong positive selection on nonsynonymous mutations and little negative selection predict a stronger negative correlation between neutral diversity and nonsynonymous divergence than observed in the actual data, supporting the importance of negative, rather than positive, selection throughout the genome. Further, we show that the widespread presence of weakly deleterious alleles, rather than a small number of strongly positively selected mutations, is responsible for the correlation between neutral genetic diversity and recombination rate. This work suggests that natural selection has affected multiple aspects of linked neutral variation throughout the human genome and that positive selection is not required to explain these observations.
While researchers have identified candidate genes that have evolved under positive Darwinian natural selection, less is known about how much of the human genome has been affected by natural selection or whether positive selection has had a greater role at shaping patterns of variation across the human genome than negative selection acting against deleterious mutations. To address these questions, we have combined patterns of genetic variation in three genome-wide resequencing datasets with population genetic models of natural selection. We find that genetic diversity and average minor allele frequency are reduced in regions of the genome with low recombination rate. Additionally, genetic diversity, human-chimp divergence, and average minor allele frequency have been reduced near genes. Overall, while we cannot exclude positive selection at a fraction of mutations, models that include many weakly deleterious mutations throughout the human genome better explain multiple aspects of the genome-wide resequencing data. This work points to negative selection as an important force for shaping patterns of variation and suggests that there are many weakly deleterious mutations at both coding and noncoding sites throughout the human genome. Understanding such mutations will be important for learning about human evolution and the genetic basis of common disease.
Genomic structural variants (SVs) are abundant in humans, differing from other variation classes in extent, origin, and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (i.e., copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analyzing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.
Estimation of allele frequency is of fundamental importance in population genetic analyses and in association mapping. In most studies using next-generation sequencing, a cost effective approach is to use medium or low-coverage data (e.g., < 15X). However, SNP calling and allele frequency estimation in such studies is associated with substantial statistical uncertainty because of varying coverage and high error rates.
We evaluate a new maximum likelihood method for estimating allele frequencies in low and medium coverage next-generation sequencing data. The method is based on integrating over uncertainty in the data for each individual rather than first calling genotypes. This method can be applied to directly test for associations in case/control studies. We use simulations to compare the likelihood method to methods based on genotype calling, and show that the likelihood method outperforms the genotype calling methods in terms of: (1) accuracy of allele frequency estimation, (2) accuracy of the estimation of the distribution of allele frequencies across neutrally evolving sites, and (3) statistical power in association mapping studies. Using real re-sequencing data from 200 individuals obtained from an exon-capture experiment, we show that the patterns observed in the simulations are also found in real data.
Overall, our results suggest that association mapping and estimation of allele frequencies should not be based on genotype calling in low to medium coverage data. Furthermore, if genotype calling methods are used, it is usually better not to filter genotypes based on the call confidence score.
Myopia is the most common ocular disorder worldwide, and high myopia in particular is one of the leading causes of blindness. Genetic factors play a critical role in the development of myopia, especially high myopia. Recently, the exome sequencing approach has been successfully used for the disease gene identification of Mendelian disorders. Here we show a successful application of exome sequencing to identify a gene for an autosomal dominant disorder, and we have identified a gene potentially responsible for high myopia in a monogenic form. We captured exomes of two affected individuals from a Han Chinese family with high myopia and performed sequencing analysis by a second-generation sequencer with a mean coverage of 30× and sufficient depth to call variants at ∼97% of each targeted exome. The shared genetic variants of these two affected individuals in the family being studied were filtered against the 1000 Genomes Project and the dbSNP131 database. A mutation A672G in zinc finger protein 644 isoform 1 (ZNF644) was identified as being related to the phenotype of this family. After we performed sequencing analysis of the exons in the ZNF644 gene in 300 sporadic cases of high myopia, we identified an additional five mutations (I587V, R680G, C699Y, 3′UTR+12 C>G, and 3′UTR+592 G>A) in 11 different patients. All these mutations were absent in 600 normal controls. The ZNF644 gene was expressed in human retinal and retinal pigment epithelium (RPE). Given that ZNF644 is predicted to be a transcription factor that may regulate genes involved in eye development, mutation may cause the axial elongation of eyeball found in high myopia patients. Our results suggest that ZNF644 might be a causal gene for high myopia in a monogenic form.
People with myopia see near objects more clearly than objects far away. Myopia is the most common ocular disorder worldwide, with a high prevalence in Asian (40%–70%) and Caucasian (20%–30%) populations. Although the etiologies of myopia have not yet been established, previous studies have indicated the involvement of genetic and environmental factors (such as close working habits, higher education levels, and higher socioeconomic class). Genetic factors play a critical role in the development of myopia, especially high myopia. In this study, we use exome sequencing, a powerful tool for a disease gene identification, to identify a gene involved in high myopia in a monogenic form among Han Chinese. Mutations in zinc finger protein 644 isoform 1 (ZNF644) were identified as potentially responsible for the phenotype of high myopia. The main feature of high myopia is axial elongation of the eye globe. Given that ZNF644 is predicted to be a transcription factor that may regulate genes involved in eye development, a mutant ZNF644 protein may impact the normal eye development and therefore may underlie the axial elongation of the eye globe in high myopia patients. Further study of the biological function of ZNF644 will provide insight into the pathogenesis of myopia.
Analysis across the genome of patterns of DNA methylation reveals a rich landscape of allele-specific epigenetic modification and consequent effects on allele-specific gene expression.
DNA methylation plays an important role in biological processes in human health and disease. Recent technological advances allow unbiased whole-genome DNA methylation (methylome) analysis to be carried out on human cells. Using whole-genome bisulfite sequencing at 24.7-fold coverage (12.3-fold per strand), we report a comprehensive (92.62%) methylome and analysis of the unique sequences in human peripheral blood mononuclear cells (PBMC) from the same Asian individual whose genome was deciphered in the YH project. PBMC constitute an important source for clinical blood tests world-wide. We found that 68.4% of CpG sites and <0.2% of non-CpG sites were methylated, demonstrating that non-CpG cytosine methylation is minor in human PBMC. Analysis of the PBMC methylome revealed a rich epigenomic landscape for 20 distinct genomic features, including regulatory, protein-coding, non-coding, RNA-coding, and repeat sequences. Integration of our methylome data with the YH genome sequence enabled a first comprehensive assessment of allele-specific methylation (ASM) between the two haploid methylomes of any individual and allowed the identification of 599 haploid differentially methylated regions (hDMRs) covering 287 genes. Of these, 76 genes had hDMRs within 2 kb of their transcriptional start sites of which >80% displayed allele-specific expression (ASE). These data demonstrate that ASM is a recurrent phenomenon and is highly correlated with ASE in human PBMCs. Together with recently reported similar studies, our study provides a comprehensive resource for future epigenomic research and confirms new sequencing technology as a paradigm for large-scale epigenomics studies.
Epigenetic modifications such as addition of methyl groups to cytosine in DNA play a role in regulating gene expression. To better understand these processes, knowledge of the methylation status of all cytosine bases in the genome (the methylome) is required. DNA methylation can differ between the two gene copies (alleles) in each cell. Such allele-specific methylation (ASM) can be due to parental origin of the alleles (imprinting), X chromosome inactivation in females, and other as yet unknown mechanisms. This may significantly alter the expression profile arising from different allele combinations in different individuals. Using advanced sequencing technology, we have determined the methylome of human peripheral blood mononuclear cells (PBMC). Importantly, the PBMC were obtained from the same male Han Chinese individual whose complete genome had previously been determined. This allowed us, for the first time, to study genome-wide differences in ASM. Our analysis shows that ASM in PBMC is higher than can be accounted for by regions known to undergo parent-of-origin imprinting and frequently (>80%) correlates with allele-specific expression (ASE) of the corresponding gene. In addition, our data reveal a rich landscape of epigenomic variation for 20 genomic features, including regulatory, coding, and non-coding sequences, and provide a valuable resource for future studies. Our work further establishes whole-genome sequencing as an efficient method for methylome analysis.
Recent studies in human genomes have demonstrated the use of de novo assemblies to identify genetic variations that are difficult for mapping-based approaches. Construction of multiple human genome assemblies is enabled by massively parallel sequencing, but a conventional bioinformatics solution is costly and slow, creating bottle-necks in the process. This review describes two public short-read de novo assembly applications that can handle human genomes, ABySS and SOAPdenovo. It also discusses the technical aspects and future challenges of human genome de novo assembly by short reads.
de novo assembly; de Bruijn graph; massively parallel sequencing
Here we present the first diploid genome sequence of an Asian individual. The genome was sequenced to 36-fold average coverage using massively parallel sequencing technology. We aligned the short reads onto the NCBI human reference genome to 99.97% coverage, and guided by the reference genome, we used uniquely mapped reads to assemble a high-quality consensus sequence for 92% of the Asian individual's genome. We identified approximately 3 million single-nucleotide polymorphisms (SNPs) inside this region, of which 13.6% were not in the dbSNP database. Genotyping analysis showed that SNP identification had high accuracy and consistency, indicating the high sequence quality of this assembly. We also carried out heterozygote phasing and haplotype prediction against HapMap CHB and JPT haplotypes (Chinese and Japanese, respectively), sequence comparison with the two available individual genomes (J. D. Watson and J. C. Venter), and structural variation identification. These variations were considered for their potential biological impact. Our sequence data and analyses demonstrate the potential usefulness of next-generation sequencing technologies for personal genomics.