Search tips
Search criteria

Results 1-25 (51)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
author:("Li, yingzi")
Cancer cells derived from different stages of tumor progression may exhibit distinct biological properties, as exemplified by the paired lung cancer cell lines H1993 and H2073. While H1993 was derived from chemo-naive metastasized tumor, H2073 originated from the chemo-resistant primary tumor from the same patient and exhibits strikingly different drug response profile. To understand the underlying genetic and epigenetic bases for their biological properties, we investigated these cells using a wide range of large-scale methods including whole genome sequencing, RNA sequencing, SNP array, DNA methylation array, and de novo genome assembly. We conducted an integrative analysis of both cell lines to distinguish between potential driver and passenger alterations. Although many genes are mutated in these cell lines, the combination of DNA- and RNA-based variant information strongly implicates a small number of genes including TP53 and STK11 as likely drivers. Likewise, we found a diverse set of genes differentially expressed between these cell lines, but only a fraction can be attributed to changes in DNA copy number or methylation. This set included the ABC transporter ABCC4, implicated in drug resistance, and the metastasis associated MET oncogene. While the rich data content allowed us to reduce the space of hypotheses that could explain most of the observed biological properties, we also caution there is a lack of statistical power and inherent limitations in such single patient case studies.
PMCID: PMC3940063  PMID: 24297535
3.  Diverse modes of genomic alteration in hepatocellular carcinoma 
Genome Biology  2014;15(8):436.
Hepatocellular carcinoma (HCC) is a heterogeneous disease with high mortality rate. Recent genomic studies have identified TP53, AXIN1, and CTNNB1 as the most frequently mutated genes. Lower frequency mutations have been reported in ARID1A, ARID2 and JAK1. In addition, hepatitis B virus (HBV) integrations into the human genome have been associated with HCC.
Here, we deep-sequence 42 HCC patients with a combination of whole genome, exome and transcriptome sequencing to identify the mutational landscape of HCC using a reasonably large discovery cohort. We find frequent mutations in TP53, CTNNB1 and AXIN1, and rare but likely functional mutations in BAP1 and IDH1. Besides frequent hepatitis B virus integrations at TERT, we identify translocations at the boundaries of TERT. A novel deletion is identified in CTNNB1 in a region that is heavily mutated in multiple cancers. We also find multiple high-allelic frequency mutations in the extracellular matrix protein LAMA2. Lower expression levels of LAMA2 correlate with a proliferative signature, and predict poor survival and higher chance of cancer recurrence in HCC patients, suggesting an important role of the extracellular matrix and cell adhesion in tumor progression of a subgroup of HCC patients.
The heterogeneous disease of HCC features diverse modes of genomic alteration. In addition to common point mutations, structural variations and methylation changes, there are several virus-associated changes, including gene disruption or activation, formation of chimeric viral-human transcripts, and DNA copy number changes. Such a multitude of genomic events likely contributes to the heterogeneous nature of HCC.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-014-0436-9) contains supplementary material, which is available to authorized users.
PMCID: PMC4189592  PMID: 25159915
4.  The South Asian Genome 
PLoS ONE  2014;9(8):e102645.
The genetic sequence variation of people from the Indian subcontinent who comprise one-quarter of the world's population, is not well described. We carried out whole genome sequencing of 168 South Asians, along with whole-exome sequencing of 147 South Asians to provide deeper characterisation of coding regions. We identify 12,962,155 autosomal sequence variants, including 2,946,861 new SNPs and 312,738 novel indels. This catalogue of SNPs and indels amongst South Asians provides the first comprehensive map of genetic variation in this major human population, and reveals evidence for selective pressures on genes involved in skin biology, metabolism, infection and immunity. Our results will accelerate the search for the genetic variants underlying susceptibility to disorders such as type-2 diabetes and cardiovascular disease which are highly prevalent amongst South Asians.
PMCID: PMC4130493  PMID: 25115870
5.  High-coverage sequencing and annotated assemblies of the budgerigar genome 
GigaScience  2014;3:11.
Parrots belong to a group of behaviorally advanced vertebrates and have an advanced ability of vocal learning relative to other vocal-learning birds. They can imitate human speech, synchronize their body movements to a rhythmic beat, and understand complex concepts of referential meaning to sounds. However, little is known about the genetics of these traits. Elucidating the genetic bases would require whole genome sequencing and a robust assembly of a parrot genome.
We present a genomic resource for the budgerigar, an Australian Parakeet (Melopsittacus undulatus) -- the most widely studied parrot species in neuroscience and behavior. We present genomic sequence data that includes over 300× raw read coverage from multiple sequencing technologies and chromosome optical maps from a single male animal. The reads and optical maps were used to create three hybrid assemblies representing some of the largest genomic scaffolds to date for a bird; two of which were annotated based on similarities to reference sets of non-redundant human, zebra finch and chicken proteins, and budgerigar transcriptome sequence assemblies. The sequence reads for this project were in part generated and used for both the Assemblathon 2 competition and the first de novo assembly of a giga-scale vertebrate genome utilizing PacBio single-molecule sequencing.
Across several quality metrics, these budgerigar assemblies are comparable to or better than the chicken and zebra finch genome assemblies built from traditional Sanger sequencing reads, and are sufficient to analyze regions that are difficult to sequence and assemble, including those not yet assembled in prior bird genomes, and promoter regions of genes differentially regulated in vocal learning brain regions. This work provides valuable data and material for genome technology development and for investigating the genomics of complex behavioral traits.
PMCID: PMC4109783  PMID: 25061512
Melopsittacus undulatus; Budgerigar; Parakeet; Next-generation sequencing; Hybrid assemblies; Optical maps; Vocal learning
6.  The duck genome and transcriptome provide insight into an avian influenza virus reservoir species 
Nature genetics  2013;45(7):776-783.
The duck (Anas platyrhynchos) is one of the principal natural hosts of influenza A viruses. We present the duck genome sequence and perform deep transcriptome analyses to investigate immune-related genes. Our data indicate that the duck possesses a contractive immune gene repertoire, as in chicken and zebra finch, and this repertoire has been shaped through lineage-specific duplications. We identify genes that are responsive to influenza A viruses using the lung transcriptomes of control ducks and ones that were infected with either a highly pathogenic (A/duck/Hubei/49/05) or a weakly pathogenic (A/goose/Hubei/65/05) H5N1 virus. Further, we show how the duck’s defense mechanisms against influenza infection have been optimized through the diversification of its β-defensin and butyrophilin-like repertoires. These analyses, in combination with the genomic and transcriptomic data, provide a resource for characterizing the interaction between host and influenza viruses.
PMCID: PMC4003391  PMID: 23749191
7.  Whole-Exome Sequencing for the Identification of Susceptibility Genes of Kashin–Beck Disease 
PLoS ONE  2014;9(4):e92298.
To identify and investigate the susceptibility genes of Kashin–Beck disease (KBD) in Chinese population.
Whole-exome capturing and sequencing technology was used for the detection of genetic variations in 19 individuals from six families with high incidence of KBD. A total of 44 polymorphisms from 41 genes were genotyped from a total of 144 cases and 144 controls by using MassARRAY under the standard protocol from Sequenom. Association was applied on the data by using PLINK1.07.
In the sequencing stage, each sample showed approximately 70-fold coverage, thus covering more than 99% of the target regions. Among the single nucleotide polymorphisms (SNPs) used in the transmission disequilibrium test, 108 had a p-value of <0.01, whereas 1056 had a p-value of <0.05. Kyoto Encyclopedia of Genes and Genomes(KEGG) pathway analysis indicates that these SNPs focus on three major pathways: regulation of actin cytoskeleton, focal adhesion, and metabolic pathways. In the validation stage, single locus effects revealed that two of these polymorphisms (rs7745040 and rs9275295) in the human leukocyte antigen (HLA)-DRB1 gene and one polymorphism (rs9473132) in CD2-associated protein (CD2AP) gene have a significant statistical association with KBD.
HLA-DRB1 and CD2AP gene were identified to be among the susceptibility genes of KBD, thus supporting the role of the autoimmune response in KBD and the possibility of shared etiology between osteoarthritis, rheumatoid arthritis, and KBD.
PMCID: PMC4002427  PMID: 24776925
8.  Exome capture from saliva produces high quality genomic and metagenomic data 
BMC Genomics  2014;15:262.
Targeted capture of genomic regions reduces sequencing cost while generating higher coverage by allowing biomedical researchers to focus on specific loci of interest, such as exons. Targeted capture also has the potential to facilitate the generation of genomic data from DNA collected via saliva or buccal cells. DNA samples derived from these cell types tend to have a lower human DNA yield, may be degraded from age and/or have contamination from bacteria or other ambient oral microbiota. However, thousands of samples have been previously collected from these cell types, and saliva collection has the advantage that it is a non-invasive and appropriate for a wide variety of research.
We demonstrate successful enrichment and sequencing of 15 South African KhoeSan exomes and 2 full genomes with samples initially derived from saliva. The expanded exome dataset enables us to characterize genetic diversity free from ascertainment bias for multiple KhoeSan populations, including new exome data from six HGDP Namibian San, revealing substantial population structure across the Kalahari Desert region. Additionally, we discover and independently verify thirty-one previously unknown KIR alleles using methods we developed to accurately map and call the highly polymorphic HLA and KIR loci from exome capture data. Finally, we show that exome capture of saliva-derived DNA yields sufficient non-human sequences to characterize oral microbial communities, including detection of bacteria linked to oral disease (e.g. Prevotella melaninogenica). For comparison, two samples were sequenced using standard full genome library preparation without exome capture and we found no systematic bias of metagenomic information between exome-captured and non-captured data.
DNA from human saliva samples, collected and extracted using standard procedures, can be used to successfully sequence high quality human exomes, and metagenomic data can be derived from non-human reads. We find that individuals from the Kalahari carry a higher oral pathogenic microbial load than samples surveyed in the Human Microbiome Project. Additionally, rare variants present in the exomes suggest strong population structure across different KhoeSan populations.
PMCID: PMC4051168  PMID: 24708091
Exomes; KhoeSan; Genetic diversity; Metagenomics; Microbiome
9.  Complete Resequencing of 40 Genomes Reveals Domestication Events and Genes in Silkworm (Bombyx) 
Science (New York, N.Y.)  2009;326(5951):433-436.
A single–base pair resolution silkworm genetic variation map was constructed from 40 domesticated and wild silkworms, each sequenced to approximately threefold coverage, representing 99.88% of the genome. We identified ∼16 million single-nucleotide polymorphisms, many indels, and structural variations. We find that the domesticated silkworms are clearly genetically differentiated from the wild ones, but they have maintained large levels of genetic variability, suggesting a short domestication event involving a large number of individuals. We also identified signals of selection at 354 candidate genes that may have been important during domestication, some of which have enriched expression in the silk gland, midgut, and testis. These data add to our understanding of the domestication processes and may have applications in devising pest control strategies and advancing the use of silkworms as efficient bioreactors.
PMCID: PMC3951477  PMID: 19713493
10.  Ancient human genome sequence of an extinct Palaeo-Eskimo 
Nature  2010;463(7282):757-762.
We report here the genome sequence of an ancient human. Obtained from ∼4,000-year-old permafrost-preserved hair, the genome represents a male individual from the first known culture to settle in Greenland. Sequenced to an average depth of 20×, we recover 79% of the diploid genome, an amount close to the practical limit of current sequencing technologies. We identify 353,151 high-confidence single-nucleotide polymorphisms (SNPs), of which 6.8% have not been reported previously. We estimate raw read contamination to be no higher than 0.8%. We use functional SNP assessment to assign possible phenotypic characteristics of the individual that belonged to a culture whose location has yielded only trace human remains. We compare the high-confidence SNPs to those of contemporary populations to find the populations most closely related to the individual. This provides evidence for a migration from Siberia into the New World some 5,500 years ago, independent of that giving rise to the modern Native Americans and Inuit.
PMCID: PMC3951495  PMID: 20148029
11.  The sequence and de novo assembly of the giant panda genome 
Li, Ruiqiang | Fan, Wei | Tian, Geng | Zhu, Hongmei | He, Lin | Cai, Jing | Huang, Quanfei | Cai, Qingle | Li, Bo | Bai, Yinqi | Zhang, Zhihe | Zhang, Yaping | Wang, Wen | Li, Jun | Wei, Fuwen | Li, Heng | Jian, Min | Li, Jianwen | Zhang, Zhaolei | Nielsen, Rasmus | Li, Dawei | Gu, Wanjun | Yang, Zhentao | Xuan, Zhaoling | Ryder, Oliver A. | Leung, Frederick Chi-Ching | Zhou, Yan | Cao, Jianjun | Sun, Xiao | Fu, Yonggui | Fang, Xiaodong | Guo, Xiaosen | Wang, Bo | Hou, Rong | Shen, Fujun | Mu, Bo | Ni, Peixiang | Lin, Runmao | Qian, Wubin | Wang, Guodong | Yu, Chang | Nie, Wenhui | Wang, Jinhuan | Wu, Zhigang | Liang, Huiqing | Min, Jiumeng | Wu, Qi | Cheng, Shifeng | Ruan, Jue | Wang, Mingwei | Shi, Zhongbin | Wen, Ming | Liu, Binghang | Ren, Xiaoli | Zheng, Huisong | Dong, Dong | Cook, Kathleen | Shan, Gao | Zhang, Hao | Kosiol, Carolin | Xie, Xueying | Lu, Zuhong | Zheng, Hancheng | Li, Yingrui | Steiner, Cynthia C. | Lam, Tommy Tsan-Yuk | Lin, Siyuan | Zhang, Qinghui | Li, Guoqing | Tian, Jing | Gong, Timing | Liu, Hongde | Zhang, Dejin | Fang, Lin | Ye, Chen | Zhang, Juanbin | Hu, Wenbo | Xu, Anlong | Ren, Yuanyuan | Zhang, Guojie | Bruford, Michael W. | Li, Qibin | Ma, Lijia | Guo, Yiran | An, Na | Hu, Yujie | Zheng, Yang | Shi, Yongyong | Li, Zhiqiang | Liu, Qing | Chen, Yanling | Zhao, Jing | Qu, Ning | Zhao, Shancen | Tian, Feng | Wang, Xiaoling | Wang, Haiyin | Xu, Lizhi | Liu, Xiao | Vinar, Tomas | Wang, Yajun | Lam, Tak-Wah | Yiu, Siu-Ming | Liu, Shiping | Zhang, Hemin | Li, Desheng | Huang, Yan | Wang, Xia | Yang, Guohua | Jiang, Zhi | Wang, Junyi | Qin, Nan | Li, Li | Li, Jingxiang | Bolund, Lars | Kristiansen, Karsten | Wong, Gane Ka-Shu | Olson, Maynard | Zhang, Xiuqing | Li, Songgang | Yang, Huanming | Wang, Jian | Wang, Jun
Nature  2009;463(7279):311-317.
Using next-generation sequencing technology alone, we have successfully generated and assembled a draft sequence of the giant panda genome. The assembled contigs (2.25 gigabases (Gb)) cover approximately 94% of the whole genome, and the remaining gaps (0.05 Gb) seem to contain carnivore-specific repeats and tandem repeats. Comparisons with the dog and human showed that the panda genome has a lower divergence rate. The assessment of panda genes potentially underlying some of its unique traits indicated that its bamboo diet might be more dependent on its gut microbiome than its own genetic composition. We also identified more than 2.7 million heterozygous single nucleotide polymorphisms in the diploid genome. Our data and analyses provide a foundation for promoting mammalian genetic research, and demonstrate the feasibility for using next-generation sequencing technologies for accurate, cost-effective and rapid de novo assembly of large eukaryotic genomes.
PMCID: PMC3951497  PMID: 20010809
12.  Genome sequencing of the high oil crop sesame provides insight into oil biosynthesis 
Genome Biology  2014;15(2):R39.
Sesame, Sesamum indicum L., is considered the queen of oilseeds for its high oil content and quality, and is grown widely in tropical and subtropical areas as an important source of oil and protein. However, the molecular biology of sesame is largely unexplored.
Here, we report a high-quality genome sequence of sesame assembled de novo with a contig N50 of 52.2 kb and a scaffold N50 of 2.1 Mb, containing an estimated 27,148 genes. The results reveal novel, independent whole genome duplication and the absence of the Toll/interleukin-1 receptor domain in resistance genes. Candidate genes and oil biosynthetic pathways contributing to high oil content were discovered by comparative genomic and transcriptomic analyses. These revealed the expansion of type 1 lipid transfer genes by tandem duplication, the contraction of lipid degradation genes, and the differential expression of essential genes in the triacylglycerol biosynthesis pathway, particularly in the early stage of seed development. Resequencing data in 29 sesame accessions from 12 countries suggested that the high genetic diversity of lipid-related genes might be associated with the wide variation in oil content. Additionally, the results shed light on the pivotal stage of seed development, oil accumulation and potential key genes for sesamin production, an important pharmacological constituent of sesame.
As an important species from the order Lamiales and a high oil crop, the sesame genome will facilitate future research on the evolution of eudicots, as well as the study of lipid biosynthesis and potential genetic improvement of sesame.
PMCID: PMC4053841  PMID: 24576357
13.  Whole genome sequencing of Ethiopian highlanders reveals conserved hypoxia tolerance genes 
Genome Biology  2014;15(2):R36.
Although it has long been proposed that genetic factors contribute to adaptation to high altitude, such factors remain largely unverified. Recent advances in high-throughput sequencing have made it feasible to analyze genome-wide patterns of genetic variation in human populations. Since traditionally such studies surveyed only a small fraction of the genome, interpretation of the results was limited.
We report here the results of the first whole genome resequencing-based analysis identifying genes that likely modulate high altitude adaptation in native Ethiopians residing at 3,500 m above sea level on Bale Plateau or Chennek field in Ethiopia. Using cross-population tests of selection, we identify regions with a significant loss of diversity, indicative of a selective sweep. We focus on a 208 kbp gene-rich region on chromosome 19, which is significant in both of the Ethiopian subpopulations sampled. This region contains eight protein-coding genes and spans 135 SNPs. To elucidate its potential role in hypoxia tolerance, we experimentally tested whether individual genes from the region affect hypoxia tolerance in Drosophila. Three genes significantly impact survival rates in low oxygen: cic, an ortholog of human CIC, Hsl, an ortholog of human LIPE, and Paf-AHα, an ortholog of human PAFAH1B3.
Our study reveals evolutionarily conserved genes that modulate hypoxia tolerance. In addition, we show that many of our results would likely be unattainable using data from exome sequencing or microarray studies. This highlights the importance of whole genome sequencing for investigating adaptation by natural selection.
PMCID: PMC4054780  PMID: 24555826
15.  The Genome of the Netherlands: design, and project goals 
Within the Netherlands a national network of biobanks has been established (Biobanking and Biomolecular Research Infrastructure-Netherlands (BBMRI-NL)) as a national node of the European BBMRI. One of the aims of BBMRI-NL is to enrich biobanks with different types of molecular and phenotype data. Here, we describe the Genome of the Netherlands (GoNL), one of the projects within BBMRI-NL. GoNL is a whole-genome-sequencing project in a representative sample consisting of 250 trio-families from all provinces in the Netherlands, which aims to characterize DNA sequence variation in the Dutch population. The parent–offspring trios include adult individuals ranging in age from 19 to 87 years (mean=53 years; SD=16 years) from birth cohorts 1910–1994. Sequencing was done on blood-derived DNA from uncultured cells and accomplished coverage was 14–15x. The family-based design represents a unique resource to assess the frequency of regional variants, accurately reconstruct haplotypes by family-based phasing, characterize short indels and complex structural variants, and establish the rate of de novo mutational events. GoNL will also serve as a reference panel for imputation in the available genome-wide association studies in Dutch and other cohorts to refine association signals and uncover population-specific variants. GoNL will create a catalog of human genetic variation in this sample that is uniquely characterized with respect to micro-geographic location and a wide range of phenotypes. The resource will be made available to the research and medical community to guide the interpretation of sequencing projects. The present paper summarizes the global characteristics of the project.
PMCID: PMC3895638  PMID: 23714750
whole-genome sequence; trio-design; population genetics
16.  Two-Step Source Tracing Strategy of Yersinia pestis and Its Historical Epidemiology in a Specific Region 
PLoS ONE  2014;9(1):e85374.
Source tracing of pathogens is critical for the control and prevention of infectious diseases. Genome sequencing by high throughput technologies is currently feasible and popular, leading to the burst of deciphered bacterial genome sequences. Utilizing the flooding genomic data for source tracing of pathogens in outbreaks is promising, and challenging as well. Here, we employed Yersinia pestis genomes from a plague outbreak at Xinghai county of China in 2009 as an example, to develop a simple two-step strategy for rapid source tracing of the outbreak. The first step was to define the phylogenetic position of the outbreak strains in a whole species tree, and the next step was to provide a detailed relationship across the outbreak strains and their suspected relatives. Through this strategy, we observed that the Xinghai plague outbreak was caused by Y. pestis that circulated in the local plague focus, where the majority of historical plague epidemics in the Qinghai-Tibet Plateau may originate from. The analytical strategy developed here will be of great help in fighting against the outbreaks of emerging infectious diseases, by pinpointing the source of pathogens rapidly with genomic epidemiological data and microbial forensics information.
PMCID: PMC3887043  PMID: 24416399
17.  Whole-genome sequencing of matched primary and metastatic hepatocellular carcinomas 
To gain biological insights into lung metastases from hepatocellular carcinoma (HCC), we compared the whole-genome sequencing profiles of primary HCC and paired lung metastases.
We used whole-genome sequencing at 33X-43X coverage to profile somatic mutations in primary HCC (HBV+) and metachronous lung metastases (> 2 years interval).
In total, 5,027-13,961 and 5,275-12,624 somatic single-nucleotide variants (SNVs) were detected in primary HCC and lung metastases, respectively. Generally, 38.88-78.49% of SNVs detected in metastases were present in primary tumors. We identified 65–221 structural variations (SVs) in primary tumors and 60–232 SVs in metastases. Comparison of these SVs shows very similar and largely overlapped mutated segments between primary and metastatic tumors. Copy number alterations between primary and metastatic pairs were also found to be closely related. Together, these preservations in genomic profiles from liver primary tumors to metachronous lung metastases indicate that the genomic features during tumorigenesis may be retained during metastasis.
We found very similar genomic alterations between primary and metastatic tumors, with a few mutations found specifically in lung metastases, which may explain the clinical observation that both primary and metastatic tumors are usually sensitive or resistant to the same systemic treatments.
PMCID: PMC3896667  PMID: 24405831
Cancer; Hepatocellular carcinomas (HCC); Lung metastasis; Somatic; Next-generation sequencing (NGS)
18.  Whole Genome Sequencing in Autism Identifies Hotspots for De Novo Germline Mutation 
Cell  2012;151(7):1431-1442.
De novo mutation plays an important role in Autism Spectrum Disorders (ASDs). Notably, pathogenic copy number variants (CNVs) are characterized by high mutation rates. We hypothesize that hypermutability is a property of ASD genes, and may also include nucleotide-substitution hotspots. We investigated global patterns of germline mutation by whole genome sequencing of monozygotic twins concordant for ASD and their parents. Mutation rates varied widely throughout the genome (by 100-fold) and could be explained by intrinsic characteristics of DNA sequence and chromatin structure. Dense clusters of mutations within individual genomes were attributable to compound mutation or gene conversion. Hypermutability was a characteristic of genes involved in ASD and other diseases. In addition, genes impacted by mutations in this study were associated with ASD in independent exome-sequencing datasets. Our findings suggest that regional hypermutation is a significant factor shaping patterns of genetic variation and disease risk in humans.
PMCID: PMC3712641  PMID: 23260136
19.  YHap: a population model for probabilistic assignment of Y haplogroups from re-sequencing data 
BMC Bioinformatics  2013;14:331.
Y haplogroup analyses are an important component of genealogical reconstruction, population genetic analyses, medical genetics and forensics. These fields are increasingly moving towards use of low-coverage, high throughput sequencing. While there have been methods recently proposed for assignment of Y haplogroups on the basis of high-coverage sequence data, assignment on the basis of low-coverage data remains challenging.
We developed a new algorithm, YHap, which uses an imputation framework to jointly predict Y chromosome genotypes and assign Y haplogroups using low coverage population sequence data. We use data from the 1000 genomes project to demonstrate that YHap provides accurate Y haplogroup assignment with less than 2x coverage.
Borrowing information across multiple samples within a population using an imputation framework enables accurate Y haplogroup assignment.
PMCID: PMC4225519  PMID: 24252171
20.  A human gut microbial gene catalog established by metagenomic sequencing 
Nature  2010;464(7285):59-65.
To understand the impact of gut microbes on human health and well-being it is crucial to assess their genetic potential. Here we describe the Illumina-based metagenomic sequencing, assembly and characterization of 3.3 million nonredundant microbial genes, derived from 576.7 Gb sequence, from faecal samples of 124 European individuals. The gene set, ~150 times larger than the human gene complement, contains an overwhelming majority of the prevalent microbial genes of the cohort and likely includes a large proportion of the prevalent human intestinal microbial genes. The genes are largely shared among individuals of the cohort. Over 99% of the genes are bacterial, suggesting that the entire cohort harbours between 1000 and 1150 prevalent bacterial species and each individual at least 160 such species, which are also largely shared. We define and describe the minimal gut metagenome and the minimal gut bacterial genome in terms of functions encoded by the gene set.
PMCID: PMC3779803  PMID: 20203603
21.  Correction: SOAP3-dp: Fast, Accurate and Sensitive GPU-Based Short Read Aligner 
PLoS ONE  2013;8(8):10.1371/annotation/823f3670-ed17-41ec-ba51-b50281651915.
PMCID: PMC3750100
22.  Whole-Genome Sequences of DA and F344 Rats with Different Susceptibilities to Arthritis, Autoimmunity, Inflammation and Cancer 
Genetics  2013;194(4):1017-1028.
DA (D-blood group of Palm and Agouti, also known as Dark Agouti) and F344 (Fischer) are two inbred rat strains with differences in several phenotypes, including susceptibility to autoimmune disease models and inflammatory responses. While these strains have been extensively studied, little information is available about the DA and F344 genomes, as only the Brown Norway (BN) and spontaneously hypertensive rat strains have been sequenced to date. Here we report the sequencing of the DA and F344 genomes using next-generation Illumina paired-end read technology and the first de novo assembly of a rat genome. DA and F344 were sequenced with an average depth of 32-fold, covered 98.9% of the BN reference genome, and included 97.97% of known rat ESTs. New sequences could be assigned to 59 million positions with previously unknown data in the BN reference genome. Differences between DA, F344, and BN included 19 million positions in novel scaffolds, 4.09 million single nucleotide polymorphisms (SNPs) (including 1.37 million new SNPs), 458,224 short insertions and deletions, and 58,174 structural variants. Genetic differences between DA, F344, and BN, including high-impact SNPs and short insertions and deletions affecting >2500 genes, are likely to account for most of the phenotypic variation between these strains. The new DA and F344 genome sequencing data should facilitate gene discovery efforts in rat models of human disease.
PMCID: PMC3730908  PMID: 23695301
BN; DA; F344; Rattus norvegicus; whole-genome sequencing; next-generation whole-genome sequencing (NGS)
23.  Exome Sequencing and Linkage Analysis Identified Tenascin-C (TNC) as a Novel Causative Gene in Nonsyndromic Hearing Loss 
PLoS ONE  2013;8(7):e69549.
In this study, a five-generation Chinese family (family F013) with progressive autosomal dominant hearing loss was mapped to a critical region spanning 28.54 Mb on chromosome 9q31.3-q34.3 by linkage analysis, which was a novel DFNA locus, assigned as DFNA56. In this interval, there were 398 annotated genes. Then, whole exome sequencing was applied in three patients and one normal individual from this family. Six single nucleotide variants and two indels were found co-segregated with the phenotypes. Then using mass spectrum (Sequenom, Inc.) to rank the eight sites, we found only the TNC gene be co-segregated with hearing loss in 53 subjects of F013. And this missense mutation (c.5317G>A, p.V1773M ) of TNC located exactly in the critical linked interval. Further screening to the coding region of this gene in 587 subjects with nonsyndromic hearing loss (NSHL) found a second missense mutation, c.5368A>T (p. T1796S), co-segregating with phenotype in the other family. These two mutations located in the conserved region of TNC and were absent in the 387 normal hearing individuals of matched geographical ancestry. Functional effects of the two mutations were predicted using SIFT and both mutations were deleterious. All these results supported that TNC may be the causal gene for the hearing loss inherited in these families. TNC encodes tenascin-C, a member of the extracellular matrix (ECM), is present in the basilar membrane (BM), and the osseous spiral lamina of the cochlea. It plays an important role in cochlear development. The up-regulated expression of TNC gene in tissue repair and neural regeneration was seen in human and zebrafish, and in sensory receptor recovery in the vestibular organ after ototoxic injury in birds. Then the absence of normal tenascin-C was supposed to cause irreversible injuries in cochlea and caused hearing loss.
PMCID: PMC3728356  PMID: 23936043
24.  An Integrated Tool to Study MHC Region: Accurate SNV Detection and HLA Genes Typing in Human MHC Region Using Targeted High-Throughput Sequencing 
PLoS ONE  2013;8(7):e69388.
The major histocompatibility complex (MHC) is one of the most variable and gene-dense regions of the human genome. Most studies of the MHC, and associated regions, focus on minor variants and HLA typing, many of which have been demonstrated to be associated with human disease susceptibility and metabolic pathways. However, the detection of variants in the MHC region, and diagnostic HLA typing, still lacks a coherent, standardized, cost effective and high coverage protocol of clinical quality and reliability. In this paper, we presented such a method for the accurate detection of minor variants and HLA types in the human MHC region, using high-throughput, high-coverage sequencing of target regions. A probe set was designed to template upon the 8 annotated human MHC haplotypes, and to encompass the 5 megabases (Mb) of the extended MHC region. We deployed our probes upon three, genetically diverse human samples for probe set evaluation, and sequencing data show that ∼97% of the MHC region, and over 99% of the genes in MHC region, are covered with sufficient depth and good evenness. 98% of genotypes called by this capture sequencing prove consistent with established HapMap genotypes. We have concurrently developed a one-step pipeline for calling any HLA type referenced in the IMGT/HLA database from this target capture sequencing data, which shows over 96% typing accuracy when deployed at 4 digital resolution. This cost-effective and highly accurate approach for variant detection and HLA typing in the MHC region may lend further insight into immune-mediated diseases studies, and may find clinical utility in transplantation medicine research. This one-step pipeline is released for general evaluation and use by the scientific community.
PMCID: PMC3722289  PMID: 23894464
25.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species 
Bradnam, Keith R | Fass, Joseph N | Alexandrov, Anton | Baranay, Paul | Bechner, Michael | Birol, Inanç | Boisvert, Sébastien | Chapman, Jarrod A | Chapuis, Guillaume | Chikhi, Rayan | Chitsaz, Hamidreza | Chou, Wen-Chi | Corbeil, Jacques | Del Fabbro, Cristian | Docking, T Roderick | Durbin, Richard | Earl, Dent | Emrich, Scott | Fedotov, Pavel | Fonseca, Nuno A | Ganapathy, Ganeshkumar | Gibbs, Richard A | Gnerre, Sante | Godzaridis, Élénie | Goldstein, Steve | Haimel, Matthias | Hall, Giles | Haussler, David | Hiatt, Joseph B | Ho, Isaac Y | Howard, Jason | Hunt, Martin | Jackman, Shaun D | Jaffe, David B | Jarvis, Erich D | Jiang, Huaiyang | Kazakov, Sergey | Kersey, Paul J | Kitzman, Jacob O | Knight, James R | Koren, Sergey | Lam, Tak-Wah | Lavenier, Dominique | Laviolette, François | Li, Yingrui | Li, Zhenyu | Liu, Binghang | Liu, Yue | Luo, Ruibang | MacCallum, Iain | MacManes, Matthew D | Maillet, Nicolas | Melnikov, Sergey | Naquin, Delphine | Ning, Zemin | Otto, Thomas D | Paten, Benedict | Paulo, Octávio S | Phillippy, Adam M | Pina-Martins, Francisco | Place, Michael | Przybylski, Dariusz | Qin, Xiang | Qu, Carson | Ribeiro, Filipe J | Richards, Stephen | Rokhsar, Daniel S | Ruby, J Graham | Scalabrin, Simone | Schatz, Michael C | Schwartz, David C | Sergushichev, Alexey | Sharpe, Ted | Shaw, Timothy I | Shendure, Jay | Shi, Yujian | Simpson, Jared T | Song, Henry | Tsarev, Fedor | Vezzi, Francesco | Vicedomini, Riccardo | Vieira, Bruno M | Wang, Jun | Worley, Kim C | Yin, Shuangye | Yiu, Siu-Ming | Yuan, Jianying | Zhang, Guojie | Zhang, Hao | Zhou, Shiguo | Korf, Ian F
GigaScience  2013;2:10.
The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and in their final output (composition of assembled sequence). More importantly, it remains largely unclear how to best assess the quality of assembled genome sequences. The Assemblathon competitions are intended to assess current state-of-the-art methods in genome assembly.
In Assemblathon 2, we provided a variety of sequence data to be assembled for three vertebrate species (a bird, a fish, and snake). This resulted in a total of 43 submitted assemblies from 21 participating teams. We evaluated these assemblies using a combination of optical map data, Fosmid sequences, and several statistical methods. From over 100 different metrics, we chose ten key measures by which to assess the overall quality of the assemblies.
Many current genome assemblers produced useful assemblies, containing a significant representation of their genes and overall genome structure. However, the high degree of variability between the entries suggests that there is still much room for improvement in the field of genome assembly and that approaches which work well in assembling the genome of one species may not necessarily work well for another.
PMCID: PMC3844414  PMID: 23870653
Genome assembly; N50; Scaffolds; Assessment; Heterozygosity; COMPASS

Results 1-25 (51)