Search tips
Search criteria

Results 1-25 (72)

Clipboard (0)

Select a Filter Below

Year of Publication
author:("Li, yingui")
1.  BASE: a practical de novo assembler for large genomes using long NGS reads 
BMC Genomics  2016;17(Suppl 5):499.
De novo genome assembly using NGS data remains a computation-intensive task especially for large genomes. In practice, efficiency is often a primary concern and favors using a more efficient assembler like SOAPdenovo2. Yet SOAPdenovo2, based on de Bruijn graph, fails to take full advantage of longer NGS reads (say, 150 bp to 250 bp from Illumina HiSeq and MiSeq). Assemblers that are based on string graphs (e.g., SGA), though less popular and also very slow, are more favorable for longer reads.
This paper shows a new de novo assembler called BASE. It enhances the classic seed-extension approach by indexing the reads efficiently to generate adaptive seeds that have high probability to appear uniquely in the genome. Such seeds form the basis for BASE to build extension trees and then to use reverse validation to remove the branches based on read coverage and paired-end information, resulting in high-quality consensus sequences of reads sharing the seeds. Such consensus sequences are then extended to contigs.
Experiments on two bacteria and four human datasets shows the advantage of BASE in both contig quality and speed in dealing with longer reads. In the experiment on bacteria, two datasets with read length of 100 bp and 250 bp were used.. Especially for the 250 bp dataset, BASE gives much better quality than SOAPdenovo2 and SGA and is simlilar to SPAdes. Regarding speed, BASE is consistently a few times faster than SPAdes and SGA, but still slower than SOAPdenovo2. BASE and Soapdenov2 are further compared using human datasets with read length 100 bp, 150 bp and 250 bp. BASE shows a higher N50 for all datasets, while the improvement becomes more significant when read length reaches 250 bp. Besides, BASE is more-meory efficent than SOAPdenovo2 when sequencing data with error rate.
BASE is a practically efficient tool for constructing contig, with significant improvement in quality for long NGS reads. It is relatively easy to extend BASE to include scaffolding.
PMCID: PMC5009518  PMID: 27586129
2.  Design of Association Studies with Pooled or Un-pooled Next-Generation Sequencing Data 
Genetic epidemiology  2010;34(5):479-491.
Most common hereditary diseases in humans are complex and multifactorial. Large scale genome-wide association studies based on SNP genotyping, have only identified a small fraction of the heritable variation of these diseases. One explanation may be that many rare variants (a minor allele frequency, MAF < 5%), which are not included in the common genotyping platforms, may contribute substantially to the genetic variation of these diseases. Next-generation sequencing, which would allow the analysis of rare variants, is now becoming so cheap that it provides a viable alternative to SNP genotyping. In this paper, we present cost-effective protocols for using next-generation sequencing in association mapping studies based on pooled and un-pooled samples, and identify optimal designs with respect to total number of individuals, number of individuals per pool, and the sequencing coverage. We perform a small empirical study to evaluate the pooling variance in a realistic setting where pooling is combined with exon capturing. To test for associations, we develop a likelihood ratio statistic that accounts for the high error rate of next-generation sequencing data. We also perform extensive simulations to determine the power and accuracy of this method. Overall, our findings suggest that with a fixed cost, sequencing many individuals at a more shallow depth with larger pool size achieves higher power than sequencing a small number of individuals in higher depth with smaller pool size, even in the presence of high error rates. Our results provide guidelines for researchers who are developing association mapping studies based on next-generation sequencing.
PMCID: PMC5001557  PMID: 20552648
3.  Genome-wide characteristics of de novo mutations in autism 
NPJ genomic medicine  2016;1:16027-1-16027-10.
De novo mutations (DNMs) are important in Autism Spectrum Disorder (ASD), but so far analyses have mainly been on the ~1.5% of the genome encoding genes. Here, we performed whole genome sequencing (WGS) of 200 ASD parent-child trios and characterized germline and somatic DNMs. We confirmed that the majority of germline DNMs (75.6%) originated from the father, and these increased significantly with paternal age only (p=4.2×10−10). However, when clustered DNMs (those within 20kb) were found in ASD, not only did they mostly originate from the mother (p=7.7×10−13), but they could also be found adjacent to de novo copy number variations (CNVs) where the mutation rate was significantly elevated (p=2.4×10−24). By comparing DNMs detected in controls, we found a significant enrichment of predicted damaging DNMs in ASD cases (p=8.0×10−9; OR=1.84), of which 15.6% (p=4.3×10−3) and 22.5% (p=7.0×10−5) were in the non-coding or genic non-coding, respectively. The non-coding elements most enriched for DNM were untranslated regions of genes, boundaries involved in exon-skipping and DNase I hypersensitive regions. Using microarrays and a novel outlier detection test, we also found aberrant methylation profiles in 2/185 (1.1%) of ASD cases. These same individuals carried independently identified DNMs in the ASD risk- and epigenetic- genes DNMT3A and ADNP. Our data begins to characterize different genome-wide DNMs, and highlight the contribution of non-coding variants, to the etiology of ASD.
PMCID: PMC4980121  PMID: 27525107 CAMSID: cams5778
4.  The Interactions of Aquaporins and Mineral Nutrients in Higher Plants 
Aquaporins, major intrinsic proteins (MIPs) present in the plasma and intracellular membranes, facilitate the transport of small neutral molecules across cell membranes in higher plants. Recently, progress has been made in understanding the mechanisms of aquaporin subcellular localization, transport selectivity, and gating properties. Although the role of aquaporins in maintaining the plant water status has been addressed, the interactions between plant aquaporins and mineral nutrients remain largely unknown. This review highlights the roles of various aquaporin orthologues in mineral nutrient uptake and transport, as well as the regulatory effects of mineral nutrients on aquaporin expression and activity, and an integrated link between aquaporins and mineral nutrient metabolism was identified.
PMCID: PMC5000627  PMID: 27483251
aquaporin; water transport; membrane protein; mineral nutrient
5.  Analysis of a four generation family reveals the widespread sequence-dependent maintenance of allelic DNA methylation in somatic and germ cells 
Scientific Reports  2016;6:19260.
Differential methylation of the homologous chromosomes, a well-known mechanism leading to genomic imprinting and X-chromosome inactivation, is widely reported at the non-imprinted regions on autosomes. To evaluate the transgenerational DNA methylation patterns in human, we analyzed the DNA methylomes of somatic and germ cells in a four-generation family. We found that allelic asymmetry of DNA methylation was pervasive at the non-imprinted loci and was likely regulated by cis-acting genetic variants. We also observed that the allelic methylation patterns for the vast majority of the cis-regulated loci were shared between the somatic and germ cells from the same individual. These results demonstrated the interaction between genetic and epigenetic variations and suggested the possibility of widespread sequence-dependent transmission of DNA methylation during spermatogenesis.
PMCID: PMC4713049  PMID: 26758766
6.  Root ABA Accumulation Enhances Rice Seedling Drought Tolerance under Ammonium Supply: Interaction with Aquaporins 
In previous studies, we demonstrated that ammonium nutrition enhances the drought tolerance of rice seedlings compared to nitrate nutrition and contributes to a higher root water uptake ability. It remains unclear why rice seedlings maintain a higher water uptake ability when supplied with ammonium under drought stress. Here, we focused on the effects of nitrogen form and drought stress on root abscisic acid (ABA) concentration and aquaporin expression using hydroponics experiments and stimulating drought stress with 10% PEG6000. Drought stress decreased the leaf photosynthetic rate and stomatal conductivity and increased the leaf temperature of plants supplied with either ammonium or nitrate, but especially under nitrate supply. After 4 h of PEG treatment, the root protoplast water permeability and the expression of root PIP and TIP genes decreased in plants supplied with ammonium or nitrate. After 24 h of PEG treatment, the root hydraulic conductivity, the protoplast water permeability, and the expression of some aquaporin genes increased in plants supplied with ammonium compared to those under non-PEG treatment. Root ABA accumulation was induced by 24 h of PEG treatment, especially in plants supplied with ammonium. The addition of exogenous ABA decreased the expression of PIP and TIP genes under non-PEG treatment but increased the expression of some of them under PEG treatment. We concluded that drought stress induced a down-regulation of aquaporin expression, which appeared earlier than did root ABA accumulation. With continued drought stress, aquaporin expression and activity increased due to root ABA accumulation in plants supplied with ammonium.
PMCID: PMC4979525  PMID: 27559341
rice; water uptake; ABA; aquaporin; drought stress
7.  Gekko japonicus genome reveals evolution of adhesive toe pads and tail regeneration 
Nature Communications  2015;6:10033.
Reptiles are the most morphologically and physiologically diverse tetrapods, and have undergone 300 million years of adaptive evolution. Within the reptilian tetrapods, geckos possess several interesting features, including the ability to regenerate autotomized tails and to climb on smooth surfaces. Here we sequence the genome of Gekko japonicus (Schlegel's Japanese Gecko) and investigate genetic elements related to its physiology. We obtain a draft G. japonicus genome sequence of 2.55 Gb and annotated 22,487 genes. Comparative genomic analysis reveals specific gene family expansions or reductions that are associated with the formation of adhesive setae, nocturnal vision and tail regeneration, as well as the diversification of olfactory sensation. The obtained genomic data provide robust genetic evidence of adaptive evolution in reptiles.
Geckos are small, agile reptiles with nocturnal habits. Here, the authors sequence the genome of the Schlegel's Japanese Gecko and reveal gene family expansions and reductions associated with formation of adhesive setae, nocturnal vision, tail regeneration, and diversification of olfactory sensation.
PMCID: PMC4673495  PMID: 26598231
8.  Full-length single-cell RNA-seq applied to a viral human cancer: applications to HPV expression and splicing analysis in HeLa S3 cells 
GigaScience  2015;4:51.
Viral infection causes multiple forms of human cancer, and HPV infection is the primary factor in cervical carcinomas. Recent single-cell RNA-seq studies highlight the tumor heterogeneity present in most cancers, but virally induced tumors have not been studied. HeLa is a well characterized HPV+ cervical cancer cell line.
We developed a new high throughput platform to prepare single-cell RNA on a nanoliter scale based on a customized microwell chip. Using this method, we successfully amplified full-length transcripts of 669 single HeLa S3 cells and 40 of them were randomly selected to perform single-cell RNA sequencing. Based on these data, we obtained a comprehensive understanding of the heterogeneity of HeLa S3 cells in gene expression, alternative splicing and fusions. Furthermore, we identified a high diversity of HPV-18 expression and splicing at the single-cell level. By co-expression analysis we identified 283 E6, E7 co-regulated genes, including CDC25, PCNA, PLK4, BUB1B and IRF1 known to interact with HPV viral proteins.
Our results reveal the heterogeneity of a virus-infected cell line. It not only provides a transcriptome characterization of HeLa S3 cells at the single cell level, but is a demonstration of the power of single cell RNA-seq analysis of virally infected cells and cancers.
Electronic supplementary material
The online version of this article (doi:10.1186/s13742-015-0091-4) contains supplementary material, which is available to authorized users.
PMCID: PMC4635585  PMID: 26550473
Single-cell transcriptome; HeLa; HPV; Virus; Tumor heterogeneity; Cancer; RNA splicing
9.  Whole Exome Sequencing Identifies Frequent Somatic Mutations in Cell-Cell Adhesion Genes in Chinese Patients with Lung Squamous Cell Carcinoma 
Scientific Reports  2015;5:14237.
Lung squamous cell carcinoma (SQCC) accounts for about 30% of all lung cancer cases. Understanding of mutational landscape for this subtype of lung cancer in Chinese patients is currently limited. We performed whole exome sequencing in samples from 100 patients with lung SQCCs to search for somatic mutations and the subsequent target capture sequencing in another 98 samples for validation. We identified 20 significantly mutated genes, including TP53, CDH10, NFE2L2 and PTEN. Pathways with frequently mutated genes included those of cell-cell adhesion/Wnt/Hippo in 76%, oxidative stress response in 21%, and phosphatidylinositol-3-OH kinase in 36% of the tested tumor samples. Mutations of Chromatin regulatory factor genes were identified at a lower frequency. In functional assays, we observed that knockdown of CDH10 promoted cell proliferation, soft-agar colony formation, cell migration and cell invasion, and overexpression of CDH10 inhibited cell proliferation. This mutational landscape of lung SQCC in Chinese patients improves our current understanding of lung carcinogenesis, early diagnosis and personalized therapy.
PMCID: PMC4621504  PMID: 26503331
10.  CCR-14-0330R1: Concurrent Alterations in TERT, KDM6A, and the BRCA Pathway in Bladder Cancer 
Genetic analysis of bladder cancer (BC) has revealed a number of frequently altered genes, including frequent alterations of the telomerase (TERT) gene promoter; though few altered genes have been functionally evaluated. Our objective is to characterize alterations observed by exome sequencing and sequencing of the TERT promoter, and to examine the functional relevance of KDM6A, a frequently mutated histone demethylase, in BC.
Experimental Design
We analyzed BC samples from 54 U.S. patients by exome and targeted sequencing and confirmed somatic variants using normal tissue from the same patient. We examined the biological function of KDM6A using in vivo and in vitro assays.
We observed frequent somatic alterations in BAP1 in 15% of tumors, including deleterious alterations to the deubiquitinase active site and the nuclear localization signal. BAP1 mutations contribute to a high frequency of tumors with BRCA pathway alterations and were significantly associated with papillary histologic features in tumors. BAP1 and KDM6A mutations significantly co-occurred in tumors. Somatic variants altering the TERT promoter were found in 69% of tumors but were not correlated with alterations in other BC genes. We examined the function of KDM6A, altered in 24% of tumors, and show depletion in human BC cells enhanced in vitro proliferation, in vivo tumor growth, and cell migration.
This study is the first to identify frequent BAP1 and BRCA pathway alterations in BC, show TERT promoter alterations are independent of other BC gene alterations, and show KDM6A loss is a driver of the BC phenotype.
PMCID: PMC4166537  PMID: 25225064
bladder neoplasms; mutation; biomarker; survival analysis; next generation sequencing; tumor suppressor; chromatin remodeler; TERT; KDM6A; UTX; BAP1; BRCA
11.  Comparison of variations detection between whole-genome amplification methods used in single-cell resequencing 
GigaScience  2015;4:37.
Single-cell resequencing (SCRS) provides many biomedical advances in variations detection at the single-cell level, but it currently relies on whole genome amplification (WGA). Three methods are commonly used for WGA: multiple displacement amplification (MDA), degenerate-oligonucleotide-primed PCR (DOP-PCR) and multiple annealing and looping-based amplification cycles (MALBAC). However, a comprehensive comparison of variations detection performance between these WGA methods has not yet been performed.
We systematically compared the advantages and disadvantages of different WGA methods, focusing particularly on variations detection. Low-coverage whole-genome sequencing revealed that DOP-PCR had the highest duplication ratio, but an even read distribution and the best reproducibility and accuracy for detection of copy-number variations (CNVs). However, MDA had significantly higher genome recovery sensitivity (~84 %) than DOP-PCR (~6 %) and MALBAC (~52 %) at high sequencing depth. MALBAC and MDA had comparable single-nucleotide variations detection efficiency, false-positive ratio, and allele drop-out ratio. We further demonstrated that SCRS data amplified by either MDA or MALBAC from a gastric cancer cell line could accurately detect gastric cancer CNVs with comparable sensitivity and specificity, including amplifications of 12p11.22 (KRAS) and 9p24.1 (JAK2, CD274, and PDCD1LG2).
Our findings provide a comprehensive comparison of variations detection performance using SCRS amplified by different WGA methods. It will guide researchers to determine which WGA method is best suited to individual experimental needs at single-cell level.
Electronic supplementary material
The online version of this article (doi:10.1186/s13742-015-0068-3) contains supplementary material, which is available to authorized users.
PMCID: PMC4527218  PMID: 26251698
Whole genome amplification; Single-cell resequencing; Variations detection; DOP-PCR; MDA; MALBAC; Next-generation sequencing
13.  Targeted gene correction minimally impacts whole-genome mutational load in human disease-specific induced pluripotent stem cell clones 
Cell stem cell  2014;15(1):31-36.
The utility of genome editing technologies for disease modeling and developing cellular therapies has been extensively documented, but the impact of these technologies on mutational load at the whole-genome level remains unclear. We performed whole-genome sequencing to evaluate the mutational load at single-base resolution in individual gene-corrected human induced pluripotent stem cells (hiPSCs) clones in three different disease models. Single-cell clones gene correction by helper-dependent adenoviral vector (HDAdV) or Transcription Activator-Like Effector Nuclease (TALEN) exhibited few off-target effects and a low level of sequence variation, comparable to that accumulated in routine hiPSC culture. The sequence variants were randomly distributed and unique to individual clones. We also combined both technologies and developed a TALEN-HDAdV hybrid vector, which significantly increased gene-correction efficiency in hiPSCs. Therefore, with careful monitoring via whole genome sequencing it is possible to apply genome editing to human pluripotent cells with minimal impact on genomic mutational load.
PMCID: PMC4144407  PMID: 24996168
14.  Discovery of biclonal origin and a novel oncogene SLC12A5 in colon cancer by single-cell sequencing 
Cell Research  2014;24(6):701-712.
Single-cell sequencing is a powerful tool for delineating clonal relationship and identifying key driver genes for personalized cancer management. Here we performed single-cell sequencing analysis of a case of colon cancer. Population genetics analyses identified two independent clones in tumor cell population. The major tumor clone harbored APC and TP53 mutations as early oncogenic events, whereas the minor clone contained preponderant CDC27 and PABPC1 mutations. The absence of APC and TP53 mutations in the minor clone supports that these two clones were derived from two cellular origins. Examination of somatic mutation allele frequency spectra of additional 21 whole-tissue exome-sequenced cases revealed the heterogeneity of clonal origins in colon cancer. Next, we identified a mutated gene SLC12A5 that showed a high frequency of mutation at the single-cell level but exhibited low prevalence at the population level. Functional characterization of mutant SLC12A5 revealed its potential oncogenic effect in colon cancer. Our study provides the first exome-wide evidence at single-cell level supporting that colon cancer could be of a biclonal origin, and suggests that low-prevalence mutations in a cohort may also play important protumorigenic roles at the individual level.
PMCID: PMC4042168  PMID: 24699064
single-cell sequencing; colon cancer; SLC12A5; biclonal; oncogene
15.  Molecular Signatures of Major Depression 
Current Biology  2015;25(9):1146-1156.
Adversity, particularly in early life, can cause illness. Clues to the responsible mechanisms may lie with the discovery of molecular signatures of stress, some of which include alterations to an individual’s somatic genome. Here, using genome sequences from 11,670 women, we observed a highly significant association between a stress-related disease, major depression, and the amount of mtDNA (p = 9.00 × 10−42, odds ratio 1.33 [95% confidence interval [CI] = 1.29–1.37]) and telomere length (p = 2.84 × 10−14, odds ratio 0.85 [95% CI = 0.81–0.89]). While both telomere length and mtDNA amount were associated with adverse life events, conditional regression analyses showed the molecular changes were contingent on the depressed state. We tested this hypothesis with experiments in mice, demonstrating that stress causes both molecular changes, which are partly reversible and can be elicited by the administration of corticosterone. Together, these results demonstrate that changes in the amount of mtDNA and telomere length are consequences of stress and entering a depressed state. These findings identify increased amounts of mtDNA as a molecular marker of MD and have important implications for understanding how stress causes the disease.
•Amount of mtDNA is increased, and telomeric DNA is shortened in major depression•Both changes can be induced with stress but are contingent on the depressed state•Changes are tissue specific and in part due to glucocorticoid secretion•Changes are in part reversible and represent switches in metabolic strategy
Cai et al. found increases in mtDNA and a reduction in telomeric DNA in cases of major depression using whole-genome sequencing. Both changes are depression state dependent. Mice exposed to chronic stress or glucorticoids showed that these changes reflect switches in metabolic strategy and are tissue specific and partial reversible.
PMCID: PMC4425463  PMID: 25913401
16.  MICA: A fast short-read aligner that takes full advantage of Many Integrated Core Architecture (MIC) 
BMC Bioinformatics  2015;16(Suppl 7):S10.
Short-read aligners have recently gained a lot of speed by exploiting the massive parallelism of GPU. An uprising alterative to GPU is Intel MIC; supercomputers like Tianhe-2, currently top of TOP500, is built with 48,000 MIC boards to offer ~55 PFLOPS. The CPU-like architecture of MIC allows CPU-based software to be parallelized easily; however, the performance is often inferior to GPU counterparts as an MIC card contains only ~60 cores (while a GPU card typically has over a thousand cores).
To better utilize MIC-enabled computers for NGS data analysis, we developed a new short-read aligner MICA that is optimized in view of MIC's limitation and the extra parallelism inside each MIC core. By utilizing the 512-bit vector units in the MIC and implementing a new seeding strategy, experiments on aligning 150 bp paired-end reads show that MICA using one MIC card is 4.9 times faster than BWA-MEM (using 6 cores of a top-end CPU), and slightly faster than SOAP3-dp (using a GPU). Furthermore, MICA's simplicity allows very efficient scale-up when multiple MIC cards are used in a node (3 cards give a 14.1-fold speedup over BWA-MEM).
MICA can be readily used by MIC-enabled supercomputers for production purpose. We have tested MICA on Tianhe-2 with 90 WGS samples (17.47 Tera-bases), which can be aligned in an hour using 400 nodes. MICA has impressive performance even though MIC is only in its initial stage of development.
Availability and implementation
MICA's source code is freely available at under GPL v3.
Supplementary information
Supplementary information is available as "Additional File 1". Datasets are available at
PMCID: PMC4423751  PMID: 25952019
17.  Novel recurrently mutated genes and a prognostic mutation signature in colorectal cancer 
Gut  2014;64(4):636-645.
Characterisation of colorectal cancer (CRC) genomes by next-generation sequencing has led to the discovery of novel recurrently mutated genes. Nevertheless, genomic data has not yet been used for CRC prognostication.
To identify recurrent somatic mutations with prognostic significance in patients with CRC.
Exome sequencing was performed to identify somatic mutations in tumour tissues of 22 patients with CRC, followed by validation of 187 recurrent and pathway-related genes using targeted capture sequencing in additional 160 cases.
Seven significantly mutated genes, including four reported (APC, TP53, KRAS and SMAD4) and three novel recurrently mutated genes (CDH10, FAT4 and DOCK2), exhibited high mutation prevalence (6–14% for novel cancer genes) and higher-than-expected number of non-silent mutations in our CRC cohort. For prognostication, a five-gene-signature (CDH10, COL6A3, SMAD4, TMEM132D, VCAN) was devised, in which mutation(s) in one or more of these genes was significantly associated with better overall survival independent of tumor-node-metastasis (TNM) staging. The median survival time was 80.4 months in the mutant group versus 42.4 months in the wild type group (p=0.0051). The prognostic significance of this signature was successfully verified using the data set from the Cancer Genome Atlas study.
The application of next-generation sequencing has led to the identification of three novel significantly mutated genes in CRC and a mutation signature that predicts survival outcomes for stratifying patients with CRC independent of TNM staging.
PMCID: PMC4392212  PMID: 24951259
18.  Inference of Purifying and Positive Selection in Three Subspecies of Chimpanzees (Pan troglodytes) from Exome Sequencing 
Genome Biology and Evolution  2015;7(4):1122-1132.
We study genome-wide nucleotide diversity in three subspecies of extant chimpanzees using exome capture. After strict filtering, Single Nucleotide Polymorphisms and indels were called and genotyped for greater than 50% of exons at a mean coverage of 35× per individual. Central chimpanzees (Pan troglodytes troglodytes) are the most polymorphic (nucleotide diversity, θw = 0.0023 per site) followed by Eastern (P. t. schweinfurthii) chimpanzees (θw = 0.0016) and Western (P. t. verus) chimpanzees (θw = 0.0008). A demographic scenario of divergence without gene flow fits the patterns of autosomal synonymous nucleotide diversity well except for a signal of recent gene flow from Western into Eastern chimpanzees. The striking contrast in X-linked versus autosomal polymorphism and divergence previously reported in Central chimpanzees is also found in Eastern and Western chimpanzees. We show that the direction of selection statistic exhibits a strong nonmonotonic relationship with the strength of purifying selection S, making it inappropriate for estimating S. We instead use counts in synonymous versus nonsynonymous frequency classes to infer the distribution of S coefficients acting on nonsynonymous mutations in each subspecies. The strength of purifying selection we infer is congruent with the differences in effective sizes of each subspecies: Central chimpanzees are undergoing the strongest purifying selection followed by Eastern and Western chimpanzees. Coding indels show stronger selection against indels changing the reading frame than observed in human populations.
PMCID: PMC4419804  PMID: 25829516
fitness effect; mutation; selection; effective size
19.  Whole-exome sequencing identifies OR2W3 mutation as a cause of autosomal dominant retinitis pigmentosa 
Scientific Reports  2015;5:9236.
Retinitis pigmentosa (RP), a heterogeneous group of inherited ocular diseases, is a genetic condition that causes retinal degeneration and eventual vision loss. Though some genes have been identified to be associated with RP, still a large part of the clinical cases could not be explained. Here we reported a four-generation Chinese family with RP, during which 6 from 9 members of the second generation affected the disease. To identify the genetic defect in this family, whole-exome sequencing together with validation analysis by Sanger sequencing were performed to find possible pathogenic mutations. After a pipeline of database filtering, including public databases and in-house databases, a novel missense mutation, c. 424 C > T transition (p.R142W) in OR2W3 gene, was identified as a potentially causative mutation for autosomal dominant RP. The mutation co-segregated with the disease phenotype over four generations. This mutation was validated in another independent three-generation family. RT-PCR analysis also identified that OR2W3 gene was expressed in HESC-RPE cell line. The results will not only enhance our current understanding of the genetic basis of RP, but also provide helpful clues for designing future studies to further investigate genetic factors for familial RP.
PMCID: PMC4363838  PMID: 25783483
20.  Excess of Rare Variants in Genes that are Key Epigenetic Regulators of Spermatogenesis in the Patients with Non-Obstructive Azoospermia 
Scientific Reports  2015;5:8785.
Non-obstructive azoospermia (NOA), a severe form of male infertility, is often suspected to be linked to currently undefined genetic abnormalities. To explore the genetic basis of this condition, we successfully sequenced ~650 infertility-related genes in 757 NOA patients and 709 fertile males. We evaluated the contributions of rare variants to the etiology of NOA by identifying individual genes showing nominal associations and testing the genetic burden of a given biological process as a whole. We found a significant excess of rare, non-silent variants in genes that are key epigenetic regulators of spermatogenesis, such as BRWD1, DNMT1, DNMT3B, RNF17, UBR2, USP1 and USP26, in NOA patients (P = 5.5 × 10−7), corresponding to a carrier frequency of 22.5% of patients and 13.7% of controls (P = 1.4 × 10−5). An accumulation of low-frequency variants was also identified in additional epigenetic genes (BRDT and MTHFR). Our study suggested the potential associations of genetic defects in genes that are epigenetic regulators with spermatogenic failure in human.
PMCID: PMC4350091  PMID: 25739334
Cancer cells derived from different stages of tumor progression may exhibit distinct biological properties, as exemplified by the paired lung cancer cell lines H1993 and H2073. While H1993 was derived from chemo-naive metastasized tumor, H2073 originated from the chemo-resistant primary tumor from the same patient and exhibits strikingly different drug response profile. To understand the underlying genetic and epigenetic bases for their biological properties, we investigated these cells using a wide range of large-scale methods including whole genome sequencing, RNA sequencing, SNP array, DNA methylation array, and de novo genome assembly. We conducted an integrative analysis of both cell lines to distinguish between potential driver and passenger alterations. Although many genes are mutated in these cell lines, the combination of DNA- and RNA-based variant information strongly implicates a small number of genes including TP53 and STK11 as likely drivers. Likewise, we found a diverse set of genes differentially expressed between these cell lines, but only a fraction can be attributed to changes in DNA copy number or methylation. This set included the ABC transporter ABCC4, implicated in drug resistance, and the metastasis associated MET oncogene. While the rich data content allowed us to reduce the space of hypotheses that could explain most of the observed biological properties, we also caution there is a lack of statistical power and inherent limitations in such single patient case studies.
PMCID: PMC3940063  PMID: 24297535
23.  Diverse modes of genomic alteration in hepatocellular carcinoma 
Genome Biology  2014;15(8):436.
Hepatocellular carcinoma (HCC) is a heterogeneous disease with high mortality rate. Recent genomic studies have identified TP53, AXIN1, and CTNNB1 as the most frequently mutated genes. Lower frequency mutations have been reported in ARID1A, ARID2 and JAK1. In addition, hepatitis B virus (HBV) integrations into the human genome have been associated with HCC.
Here, we deep-sequence 42 HCC patients with a combination of whole genome, exome and transcriptome sequencing to identify the mutational landscape of HCC using a reasonably large discovery cohort. We find frequent mutations in TP53, CTNNB1 and AXIN1, and rare but likely functional mutations in BAP1 and IDH1. Besides frequent hepatitis B virus integrations at TERT, we identify translocations at the boundaries of TERT. A novel deletion is identified in CTNNB1 in a region that is heavily mutated in multiple cancers. We also find multiple high-allelic frequency mutations in the extracellular matrix protein LAMA2. Lower expression levels of LAMA2 correlate with a proliferative signature, and predict poor survival and higher chance of cancer recurrence in HCC patients, suggesting an important role of the extracellular matrix and cell adhesion in tumor progression of a subgroup of HCC patients.
The heterogeneous disease of HCC features diverse modes of genomic alteration. In addition to common point mutations, structural variations and methylation changes, there are several virus-associated changes, including gene disruption or activation, formation of chimeric viral-human transcripts, and DNA copy number changes. Such a multitude of genomic events likely contributes to the heterogeneous nature of HCC.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-014-0436-9) contains supplementary material, which is available to authorized users.
PMCID: PMC4189592  PMID: 25159915
24.  The South Asian Genome 
PLoS ONE  2014;9(8):e102645.
The genetic sequence variation of people from the Indian subcontinent who comprise one-quarter of the world's population, is not well described. We carried out whole genome sequencing of 168 South Asians, along with whole-exome sequencing of 147 South Asians to provide deeper characterisation of coding regions. We identify 12,962,155 autosomal sequence variants, including 2,946,861 new SNPs and 312,738 novel indels. This catalogue of SNPs and indels amongst South Asians provides the first comprehensive map of genetic variation in this major human population, and reveals evidence for selective pressures on genes involved in skin biology, metabolism, infection and immunity. Our results will accelerate the search for the genetic variants underlying susceptibility to disorders such as type-2 diabetes and cardiovascular disease which are highly prevalent amongst South Asians.
PMCID: PMC4130493  PMID: 25115870
25.  Variation and association to diabetes in 2000 full mtDNA sequences mined from an exome study in a Danish population 
European Journal of Human Genetics  2014;22(8):1040-1045.
In this paper, we mine full mtDNA sequences from an exome capture data set of 2000 Danes, showing that it is possible to get high-quality full-genome sequences of the mitochondrion from this resource. The sample includes 1000 individuals with type 2 diabetes and 1000 controls. We characterise the variation found in the mtDNA sequence in Danes and relate the variation to diabetes risk as well as to several blood phenotypes of the controls but find no significant associations. We report 2025 polymorphisms, of which 393 have not been reported previously. These 393 mutations are both very rare and estimated to be caused by very recent mutations but individuals with type 2 diabetes do not possess more of these variants. Population genetics analysis using Bayesian skyline plot shows a recent history of rapid population growth in the Danish population in accordance with the fact that >40% of variable sites are observed as singletons.
PMCID: PMC4350597  PMID: 24448545
diabetes; mtDNA; population history

Results 1-25 (72)