author:("Li, ruijiang")
1.  Design of Association Studies with Pooled or Un-pooled Next-Generation Sequencing Data 
Genetic epidemiology  2010;34(5):479-491.
Most common hereditary diseases in humans are complex and multifactorial. Large scale genome-wide association studies based on SNP genotyping, have only identified a small fraction of the heritable variation of these diseases. One explanation may be that many rare variants (a minor allele frequency, MAF < 5%), which are not included in the common genotyping platforms, may contribute substantially to the genetic variation of these diseases. Next-generation sequencing, which would allow the analysis of rare variants, is now becoming so cheap that it provides a viable alternative to SNP genotyping. In this paper, we present cost-effective protocols for using next-generation sequencing in association mapping studies based on pooled and un-pooled samples, and identify optimal designs with respect to total number of individuals, number of individuals per pool, and the sequencing coverage. We perform a small empirical study to evaluate the pooling variance in a realistic setting where pooling is combined with exon capturing. To test for associations, we develop a likelihood ratio statistic that accounts for the high error rate of next-generation sequencing data. We also perform extensive simulations to determine the power and accuracy of this method. Overall, our findings suggest that with a fixed cost, sequencing many individuals at a more shallow depth with larger pool size achieves higher power than sequencing a small number of individuals in higher depth with smaller pool size, even in the presence of high error rates. Our results provide guidelines for researchers who are developing association mapping studies based on next-generation sequencing.
PMCID: PMC5001557  PMID: 20552648
2.  Genomic Analyses Reveal Demographic History and Temperate Adaptation of the Newly Discovered Honey Bee Subspecies Apis mellifera sinisxinyuan n. ssp 
Molecular Biology and Evolution  2016;33(5):1337-1348.
Studying the genetic signatures of climate-driven selection can produce insights into local adaptation and the potential impacts of climate change on populations. The honey bee (Apis mellifera) is an interesting species to study local adaptation because it originated in tropical/subtropical climatic regions and subsequently spread into temperate regions. However, little is known about the genetic basis of its adaptation to temperate climates. Here, we resequenced the whole genomes of ten individual bees from a newly discovered population in temperate China and downloaded resequenced data from 35 individuals from other populations. We found that the new population is an undescribed subspecies in the M-lineage of A. mellifera (Apis mellifera sinisxinyuan). Analyses of population history show that long-term global temperature has strongly influenced the demographic history of A. m. sinisxinyuan and its divergence from other subspecies. Further analyses comparing temperate and tropical populations identified several candidate genes related to fat body and the Hippo signaling pathway that are potentially involved in adaptation to temperate climates. Our results provide insights into the demographic history of the newly discovered A. m. sinisxinyuan, as well as the genetic basis of adaptation of A. mellifera to temperate climates at the genomic level. These findings will facilitate the selective breeding of A. mellifera to improve the survival of overwintering colonies.
PMCID: PMC4839221  PMID: 26823447
Apis mellifera sinisxinyuan; new subspecies; temperate climates; population history; adaptation
3.  Characterization of Genomic Variants Associated with Scout and Recruit Behavioral Castes in Honey Bees Using Whole-Genome Sequencing 
PLoS ONE  2016;11(1):e0146430.
Among forager honey bees, scouts seek new resources and return to the colony, enlisting recruits to collect these resources. Differentially expressed genes between these behaviors and genetic variability in scouting phenotypes have been reported. Whole-genome sequencing of 44 Apis mellifera scouts and recruits was undertaken to detect variants and further understand the genetic architecture underlying the behavioral differences between scouts and recruits. The median coverage depth in recruits and scouts was 10.01 and 10.7 X, respectively. Representation of bacterial species among the unmapped reads reflected a more diverse microbiome in scouts than recruits. Overall, 1,412,705 polymorphic positions were analyzed for associations with scouting behavior, and 212 significant (p-value < 0.0001) associations with scouting corresponding to 137 positions were detected. Most frequent putative transcription factor binding sites proximal to significant variants included Broad-complex 4, Broad-complex 1, Hunchback, and CF2-II. Three variants associated with scouting were located within coding regions of ncRNAs including one codon change (LOC102653644) and 2 frameshift indels (LOC102654879 and LOC102655256). Significant variants were also identified on the 5’UTR of membrin, and 3’UTRs of laccase 2 and diacylglycerol kinase theta. The 60 significant variants located within introns corresponded to 39 genes and most of these positions were > 1000 bp apart from each other. A number of these variants were mapped to ncRNA LOC100578102, solute carrier family 12 member 6-like gene, and LOC100576965 (meprin and TRAF-C homology domain containing gene). Functional categories represented among the genes corresponding to significant variants included: neuronal function, exoskeleton, immune response, salivary gland development, and enzymatic food processing. These categories offer a glimpse into the molecular support to the behaviors of scouts and recruits. The level of association between genomic variants and scouting behavior observed in this study may be linked to the honey bee’s genomic plasticity and fluidity of transition between castes.
PMCID: PMC4718678  PMID: 26784945
4.  Genetic responses to seasonal variation in altitudinal stress: whole-genome resequencing of great tit in eastern Himalayas 
Scientific Reports  2015;5:14256.
Species that undertake altitudinal migrations are exposed to a considerable seasonal variation in oxygen levels and temperature. How they cope with this was studied in a population of great tit (Parus major) that breeds at high elevations and winters at lower elevations in the eastern Himalayas. Comparison of population genomics of high altitudinal great tits and those living in lowlands revealed an accelerated genetic selection for carbohydrate energy metabolism (amino sugar, nucleotide sugar metabolism and insulin signaling pathways) and hypoxia response (PI3K-akt, mTOR and MAPK signaling pathways) in the high altitudinal population. The PI3K-akt, mTOR and MAPK pathways modulate the hypoxia-inducible factors, HIF-1α and VEGF protein expression thus indirectly regulate hypoxia induced angiogenesis, erythropoiesis and vasodilatation. The strategies observed in high altitudinal great tits differ from those described in a closely related species on the Tibetan Plateau, the sedentary ground tit (Parus humilis). This species has enhanced selection in lipid-specific metabolic pathways and hypoxia-inducible factor pathway (HIF-1). Comparative population genomics also revealed selection for larger body size in high altitudinal great tits.
PMCID: PMC4585896  PMID: 26404527
5.  The miRNA Transcriptome Directly Reflects the Physiological and Biochemical Differences between Red, White, and Intermediate Muscle Fiber Types 
MicroRNAs (miRNAs) are small non-coding RNAs that can regulate their target genes at the post-transcriptional level. Skeletal muscle comprises different fiber types that can be broadly classified as red, intermediate, and white. Recently, a set of miRNAs was found expressed in a fiber type-specific manner in red and white fiber types. However, an in-depth analysis of the miRNA transcriptome differences between all three fiber types has not been undertaken. Herein, we collected 15 porcine skeletal muscles from different anatomical locations, which were then clearly divided into red, white, and intermediate fiber type based on the ratios of myosin heavy chain isoforms. We further illustrated that three muscles, which typically represented each muscle fiber type (i.e., red: peroneal longus (PL), intermediate: psoas major muscle (PMM), white: longissimus dorsi muscle (LDM)), have distinct metabolic patterns of mitochondrial and glycolytic enzyme levels. Furthermore, we constructed small RNA libraries for PL, PMM, and LDM using a deep sequencing approach. Results showed that the differentially expressed miRNAs were mainly enriched in PL and played a vital role in myogenesis and energy metabolism. Overall, this comprehensive analysis will contribute to a better understanding of the miRNA regulatory mechanism that achieves the phenotypic diversity of skeletal muscles.
PMCID: PMC4463610  PMID: 25938964
miRNA; fiber type; pig; myogenesis; energy metabolism
6.  MICA: A fast short-read aligner that takes full advantage of Many Integrated Core Architecture (MIC) 
BMC Bioinformatics  2015;16(Suppl 7):S10.
Short-read aligners have recently gained a lot of speed by exploiting the massive parallelism of GPU. An uprising alterative to GPU is Intel MIC; supercomputers like Tianhe-2, currently top of TOP500, is built with 48,000 MIC boards to offer ~55 PFLOPS. The CPU-like architecture of MIC allows CPU-based software to be parallelized easily; however, the performance is often inferior to GPU counterparts as an MIC card contains only ~60 cores (while a GPU card typically has over a thousand cores).
To better utilize MIC-enabled computers for NGS data analysis, we developed a new short-read aligner MICA that is optimized in view of MIC's limitation and the extra parallelism inside each MIC core. By utilizing the 512-bit vector units in the MIC and implementing a new seeding strategy, experiments on aligning 150 bp paired-end reads show that MICA using one MIC card is 4.9 times faster than BWA-MEM (using 6 cores of a top-end CPU), and slightly faster than SOAP3-dp (using a GPU). Furthermore, MICA's simplicity allows very efficient scale-up when multiple MIC cards are used in a node (3 cards give a 14.1-fold speedup over BWA-MEM).
MICA can be readily used by MIC-enabled supercomputers for production purpose. We have tested MICA on Tianhe-2 with 90 WGS samples (17.47 Tera-bases), which can be aligned in an hour using 400 nodes. MICA has impressive performance even though MIC is only in its initial stage of development.
Availability and implementation
MICA's source code is freely available at under GPL v3.
Supplementary information
Supplementary information is available as "Additional File 1". Datasets are available at
PMCID: PMC4423751  PMID: 25952019
7.  The general amino acid control pathway regulates mTOR and autophagy during serum/glutamine starvation 
The Journal of Cell Biology  2014;206(2):173-182.
To meet their metabolic needs, starved cells first activate autophagy, but activation in parallel of the general amino acid control pathway increases amino acid uptake, leading to reactivation of mTOR and down-regulation of autophagy.
Organisms have evolved elaborate mechanisms to adjust intracellular nutrient levels in response to fluctuating availability of exogenous nutrients. During starvation, cells can enhance amino acid uptake and synthesis through the general amino acid control (GAAC) pathway, whereas nonessential cellular contents are recycled by autophagy. How these two pathways are coordinated in response to starvation is currently unknown. Here we show that the GAAC pathway couples exogenous amino acid availability with autophagy. Starvation caused deactivation of mTOR, which then activated autophagy. In parallel, serum/glutamine starvation activated the GAAC pathway, which up-regulated amino acid transporters, leading to increased amino acid uptake. This elevated the intracellular amino acid level, which in turn reactivated mTOR and suppressed autophagy. Knockdown of activating transcription factor 4, the major transcription factor in the GAAC pathway, or of SLC7A5, a leucine transporter, caused impaired mTOR reactivation and much higher levels of autophagy. Thus, the GAAC pathway modulates autophagy by regulating amino acid uptake and mTOR reactivation during serum/glutamine starvation.
PMCID: PMC4107793  PMID: 25049270
Cancer cells derived from different stages of tumor progression may exhibit distinct biological properties, as exemplified by the paired lung cancer cell lines H1993 and H2073. While H1993 was derived from chemo-naive metastasized tumor, H2073 originated from the chemo-resistant primary tumor from the same patient and exhibits strikingly different drug response profile. To understand the underlying genetic and epigenetic bases for their biological properties, we investigated these cells using a wide range of large-scale methods including whole genome sequencing, RNA sequencing, SNP array, DNA methylation array, and de novo genome assembly. We conducted an integrative analysis of both cell lines to distinguish between potential driver and passenger alterations. Although many genes are mutated in these cell lines, the combination of DNA- and RNA-based variant information strongly implicates a small number of genes including TP53 and STK11 as likely drivers. Likewise, we found a diverse set of genes differentially expressed between these cell lines, but only a fraction can be attributed to changes in DNA copy number or methylation. This set included the ABC transporter ABCC4, implicated in drug resistance, and the metastasis associated MET oncogene. While the rich data content allowed us to reduce the space of hypotheses that could explain most of the observed biological properties, we also caution there is a lack of statistical power and inherent limitations in such single patient case studies.
PMCID: PMC3940063  PMID: 24297535
9.  Whole-genome sequencing of Berkshire (European native pig) provides insights into its origin and domestication 
Scientific Reports  2014;4:4678.
Domesticated organisms have experienced strong selective pressures directed at genes or genomic regions controlling traits of biological, agricultural or medical importance. The genome of native and domesticated pigs provide a unique opportunity for tracing the history of domestication and identifying signatures of artificial selection. Here we used whole-genome sequencing to explore the genetic relationships among the European native pig Berkshire and breeds that are distributed worldwide, and to identify genomic footprints left by selection during the domestication of Berkshire. Numerous nonsynonymous SNPs-containing genes fall into olfactory-related categories, which are part of a rapidly evolving superfamily in the mammalian genome. Phylogenetic analyses revealed a deep phylogenetic split between European and Asian pigs rather than between domestic and wild pigs. Admixture analysis exhibited higher portion of Chinese genetic material for the Berkshire pigs, which is consistent with the historical record regarding its origin. Selective sweep analyses revealed strong signatures of selection affecting genomic regions that harbor genes underlying economic traits such as disease resistance, pork yield, fertility, tameness and body length. These discoveries confirmed the history of origin of Berkshire pig by genome-wide analysis and illustrate how domestication has shaped the patterns of genetic variation.
PMCID: PMC3985078  PMID: 24728479
10.  Complete Resequencing of 40 Genomes Reveals Domestication Events and Genes in Silkworm (Bombyx) 
Science (New York, N.Y.)  2009;326(5951):433-436.
A single–base pair resolution silkworm genetic variation map was constructed from 40 domesticated and wild silkworms, each sequenced to approximately threefold coverage, representing 99.88% of the genome. We identified ∼16 million single-nucleotide polymorphisms, many indels, and structural variations. We find that the domesticated silkworms are clearly genetically differentiated from the wild ones, but they have maintained large levels of genetic variability, suggesting a short domestication event involving a large number of individuals. We also identified signals of selection at 354 candidate genes that may have been important during domestication, some of which have enriched expression in the silk gland, midgut, and testis. These data add to our understanding of the domestication processes and may have applications in devising pest control strategies and advancing the use of silkworms as efficient bioreactors.
PMCID: PMC3951477  PMID: 19713493
11.  The sequence and de novo assembly of the giant panda genome 
Li, Ruiqiang | Fan, Wei | Tian, Geng | Zhu, Hongmei | He, Lin | Cai, Jing | Huang, Quanfei | Cai, Qingle | Li, Bo | Bai, Yinqi | Zhang, Zhihe | Zhang, Yaping | Wang, Wen | Li, Jun | Wei, Fuwen | Li, Heng | Jian, Min | Li, Jianwen | Zhang, Zhaolei | Nielsen, Rasmus | Li, Dawei | Gu, Wanjun | Yang, Zhentao | Xuan, Zhaoling | Ryder, Oliver A. | Leung, Frederick Chi-Ching | Zhou, Yan | Cao, Jianjun | Sun, Xiao | Fu, Yonggui | Fang, Xiaodong | Guo, Xiaosen | Wang, Bo | Hou, Rong | Shen, Fujun | Mu, Bo | Ni, Peixiang | Lin, Runmao | Qian, Wubin | Wang, Guodong | Yu, Chang | Nie, Wenhui | Wang, Jinhuan | Wu, Zhigang | Liang, Huiqing | Min, Jiumeng | Wu, Qi | Cheng, Shifeng | Ruan, Jue | Wang, Mingwei | Shi, Zhongbin | Wen, Ming | Liu, Binghang | Ren, Xiaoli | Zheng, Huisong | Dong, Dong | Cook, Kathleen | Shan, Gao | Zhang, Hao | Kosiol, Carolin | Xie, Xueying | Lu, Zuhong | Zheng, Hancheng | Li, Yingrui | Steiner, Cynthia C. | Lam, Tommy Tsan-Yuk | Lin, Siyuan | Zhang, Qinghui | Li, Guoqing | Tian, Jing | Gong, Timing | Liu, Hongde | Zhang, Dejin | Fang, Lin | Ye, Chen | Zhang, Juanbin | Hu, Wenbo | Xu, Anlong | Ren, Yuanyuan | Zhang, Guojie | Bruford, Michael W. | Li, Qibin | Ma, Lijia | Guo, Yiran | An, Na | Hu, Yujie | Zheng, Yang | Shi, Yongyong | Li, Zhiqiang | Liu, Qing | Chen, Yanling | Zhao, Jing | Qu, Ning | Zhao, Shancen | Tian, Feng | Wang, Xiaoling | Wang, Haiyin | Xu, Lizhi | Liu, Xiao | Vinar, Tomas | Wang, Yajun | Lam, Tak-Wah | Yiu, Siu-Ming | Liu, Shiping | Zhang, Hemin | Li, Desheng | Huang, Yan | Wang, Xia | Yang, Guohua | Jiang, Zhi | Wang, Junyi | Qin, Nan | Li, Li | Li, Jingxiang | Bolund, Lars | Kristiansen, Karsten | Wong, Gane Ka-Shu | Olson, Maynard | Zhang, Xiuqing | Li, Songgang | Yang, Huanming | Wang, Jian | Wang, Jun
Nature  2009;463(7279):311-317.
Using next-generation sequencing technology alone, we have successfully generated and assembled a draft sequence of the giant panda genome. The assembled contigs (2.25 gigabases (Gb)) cover approximately 94% of the whole genome, and the remaining gaps (0.05 Gb) seem to contain carnivore-specific repeats and tandem repeats. Comparisons with the dog and human showed that the panda genome has a lower divergence rate. The assessment of panda genes potentially underlying some of its unique traits indicated that its bamboo diet might be more dependent on its gut microbiome than its own genetic composition. We also identified more than 2.7 million heterozygous single nucleotide polymorphisms in the diploid genome. Our data and analyses provide a foundation for promoting mammalian genetic research, and demonstrate the feasibility for using next-generation sequencing technologies for accurate, cost-effective and rapid de novo assembly of large eukaryotic genomes.
PMCID: PMC3951497  PMID: 20010809
12.  Whole-genome analysis of 5-hydroxymethylcytosine and 5-methylcytosine at base resolution in the human brain 
Genome Biology  2014;15(3):R49.
5-methylcytosine (mC) can be oxidized by the tet methylcytosine dioxygenase (Tet) family of enzymes to 5-hydroxymethylcytosine (hmC), which is an intermediate of mC demethylation and may also be a stable epigenetic modification that influences chromatin structure. hmC is particularly abundant in mammalian brains but its function is currently unknown. A high-resolution hydroxymethylome map is required to fully understand the function of hmC in the human brain.
We present genome-wide and single-base resolution maps of hmC and mC in the human brain by combined application of Tet-assisted bisulfite sequencing and bisulfite sequencing. We demonstrate that hmCs increase markedly from the fetal to the adult stage, and in the adult brain, 13% of all CpGs are highly hydroxymethylated with strong enrichment at genic regions and distal regulatory elements. Notably, hmC peaks are identified at the 5′splicing sites at the exon-intron boundary, suggesting a mechanistic link between hmC and splicing. We report a surprising transcription-correlated hmC bias toward the sense strand and an mC bias toward the antisense strand of gene bodies. Furthermore, hmC is negatively correlated with H3K27me3-marked and H3K9me3-marked repressive genomic regions, and is more enriched at poised enhancers than active enhancers.
We provide single-base resolution hmC and mC maps in the human brain and our data imply novel roles of hmC in regulating splicing and gene expression. Hydroxymethylation is the main modification status for a large portion of CpGs situated at poised enhancers and actively transcribed regions, suggesting its roles in epigenetic tuning at these regions.
PMCID: PMC4053808  PMID: 24594098
13.  A human gut microbial gene catalog established by metagenomic sequencing 
Nature  2010;464(7285):59-65.
To understand the impact of gut microbes on human health and well-being it is crucial to assess their genetic potential. Here we describe the Illumina-based metagenomic sequencing, assembly and characterization of 3.3 million nonredundant microbial genes, derived from 576.7 Gb sequence, from faecal samples of 124 European individuals. The gene set, ~150 times larger than the human gene complement, contains an overwhelming majority of the prevalent microbial genes of the cohort and likely includes a large proportion of the prevalent human intestinal microbial genes. The genes are largely shared among individuals of the cohort. Over 99% of the genes are bacterial, suggesting that the entire cohort harbours between 1000 and 1150 prevalent bacterial species and each individual at least 160 such species, which are also largely shared. We define and describe the minimal gut metagenome and the minimal gut bacterial genome in terms of functions encoded by the gene set.
PMCID: PMC3779803  PMID: 20203603
14.  Correction: SOAP3-dp: Fast, Accurate and Sensitive GPU-Based Short Read Aligner 
PLoS ONE  2013;8(8):10.1371/annotation/823f3670-ed17-41ec-ba51-b50281651915.
PMCID: PMC3750100
15.  Sequencing of Fifty Human Exomes Reveals Adaptation to High Altitude 
Science (New York, N.Y.)  2010;329(5987):75-78.
Residents of the Tibetan Plateau show heritable adaptations to extreme altitude. We sequenced 50 exomes of ethnic Tibetans, encompassing coding sequences of 92% of human genes, with an average coverage of 18X per individual. Genes showing population-specific allele frequency changes, which represent strong candidates for altitude adaptation, were identified. The strongest signal of natural selection came from EPAS1, a transcription factor involved in response to hypoxia. One SNP at EPAS1 shows a 78% frequency difference between Tibetan and Han samples, representing the fastest allele frequency change observed at any human gene to date. This SNP’s association with erythrocyte abundance supports the role of EPAS1 in adaptation to hypoxia. Thus, a population genomic survey has revealed a functionally important locus in genetic adaptation to high altitude.
PMCID: PMC3711608  PMID: 20595611
16.  Probing Meiotic Recombination and Aneuploidy of Single Sperm Cells by Whole Genome Sequencing 
Science (New York, N.Y.)  2012;338(6114):1627-1630.
Meiotic recombination creates genetic diversity and ensures segregation of homologous chromosomes. Previous population analyses yielded results averaged among individuals and impacted by evolutionary pressures. Here we sequenced 99 sperm from an Asian male using the newly developed amplification method—Multiple Annealing and Looping-Based Amplification Cycles (MALBAC)—to phase the personal genome and map at high resolution recombination events, which are non-uniformly distributed across the genome in the absence of selection pressure. The paucity of recombination near transcription start sites observed in individual sperm indicates such a phenomenon is intrinsic to the molecular mechanism of meiosis. Interestingly, a decreased crossover frequency in companion with an increase of autosomal aneuploidy is observable on a global per-sperm basis.
PMCID: PMC3590491  PMID: 23258895
17.  SOAP3-dp: Fast, Accurate and Sensitive GPU-Based Short Read Aligner 
PLoS ONE  2013;8(5):e65632.
To tackle the exponentially increasing throughput of Next-Generation Sequencing (NGS), most of the existing short-read aligners can be configured to favor speed in trade of accuracy and sensitivity. SOAP3-dp, through leveraging the computational power of both CPU and GPU with optimized algorithms, delivers high speed and sensitivity simultaneously. Compared with widely adopted aligners including BWA, Bowtie2, SeqAlto, CUSHAW2, GEM and GPU-based aligners BarraCUDA and CUSHAW, SOAP3-dp was found to be two to tens of times faster, while maintaining the highest sensitivity and lowest false discovery rate (FDR) on Illumina reads with different lengths. Transcending its predecessor SOAP3, which does not allow gapped alignment, SOAP3-dp by default tolerates alignment similarity as low as 60%. Real data evaluation using human genome demonstrates SOAP3-dp's power to enable more authentic variants and longer Indels to be discovered. Fosmid sequencing shows a 9.1% FDR on newly discovered deletions. SOAP3-dp natively supports BAM file format and provides the same scoring scheme as BWA, which enables it to be integrated into existing analysis pipelines. SOAP3-dp has been deployed on Amazon-EC2, NIH-Biowulf and Tianhe-1A.
PMCID: PMC3669295  PMID: 23741504
18.  Mitochondrial DNA Evidence Indicates the Local Origin of Domestic Pigs in the Upstream Region of the Yangtze River 
PLoS ONE  2012;7(12):e51649.
Previous studies have indicated two main domestic pig dispersal routes in East Asia: one is from the Mekong region, through the upstream region of the Yangtze River (URYZ) to the middle and upstream regions of the Yellow River, the other is from the middle and downstream regions of the Yangtze River to the downstream region of the Yellow River, and then to northeast China. The URYZ was regarded as a passageway of the former dispersal route; however, this assumption remains to be further investigated. We therefore analyzed the hypervariable segements of mitochondrial DNA from 513 individual pigs mainly from Sichuan and the Tibet highlands and 1,394 publicly available sequences from domestic pigs and wild boars across Asia. From the phylogenetic tree, most of the samples fell into a mixed group that was difficult to distinguish by breed or geography. The total network analysis showed that the URYZ pigs possessed a dominant position in haplogroup A and domestic pigs shared the same core haplotype with the local wild boars, suggesting that pigs in group A were most likely derived from the URYZ pool. In addition, a region-wise network analysis determined that URYZ contains 42 haplotypes of which 22 are unique indicating the high diversity in this region. In conclusion, our findings confirmed that pigs from the URYZ were domesticated in situ.
PMCID: PMC3521662  PMID: 23272130
19.  An atlas of DNA methylomes in porcine adipose and muscle tissues 
Nature communications  2012;3:850.
It is evident that epigenetic factors, especially DNA methylation, play essential roles in obesity development. Using pig as a model, here we investigated the systematic association between DNA methylation and obesity. We sampled eight variant adipose and two distinct skeletal muscle tissues from three pig breeds living within comparable environments but displaying distinct fat level. We generated 1,381 gigabases (Gb) of sequence data from 180 methylated DNA immunoprecipitation (MeDIP) libraries, and provided a genome-wide DNA methylation map as well as a gene expression map for adipose and muscle studies. The analysis showed global similarity and difference among breeds, sexes and anatomic locations, and identified the differentially methylated regions (DMRs). The DMRs in promoters are highly associated with obesity development via expression repression of both known obesity-related genes and novel genes. This comprehensive map provides a solid basis for exploring epigenetic mechanisms of adipose deposition and muscle growth.
PMCID: PMC3508711  PMID: 22617290
20.  Co-methylated Genes in Different Adipose Depots of Pig are Associated with Metabolic, Inflammatory and Immune Processes 
It is well established that the metabolic risk factors of obesity and its comorbidities are more attributed to adipose tissue distribution rather than total adipose mass. Since emerging evidence suggests that epigenetic regulation plays an important role in the aetiology of obesity, we conducted a genome-wide methylation analysis on eight different adipose depots of three pig breeds living within comparable environments but displaying distinct fat level using methylated DNA immunoprecipitation sequencing. We aimed to investigate the systematic association between anatomical location-specific DNA methylation status of different adipose depots and obesity-related phenotypes. We show here that compared to subcutaneous adipose tissues which primarily modulate metabolic indicators, visceral adipose tissues and intermuscular adipose tissue, which are the metabolic risk factors of obesity, are primarily associated with impaired inflammatory and immune responses. This study presents epigenetic evidence for functionally relevant methylation differences between different adipose depots.
PMCID: PMC3372887  PMID: 22719223
pig; subcutaneous adipose tissue; visceral adipose tissue; DNA methylation; MeDIP-seq
21.  Mapping copy number variation by population scale genome sequencing 
Nature  2011;470(7332):59-65.
Genomic structural variants (SVs) are abundant in humans, differing from other variation classes in extent, origin, and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (i.e., copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analyzing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.
PMCID: PMC3077050  PMID: 21293372
22.  SOAPsplice: Genome-Wide ab initio Detection of Splice Junctions from RNA-Seq Data 
RNA-Seq, a method using next generation sequencing technologies to sequence the transcriptome, facilitates genome-wide analysis of splice junction sites. In this paper, we introduce SOAPsplice, a robust tool to detect splice junctions using RNA-Seq data without using any information of known splice junctions. SOAPsplice uses a novel two-step approach consisting of first identifying as many reasonable splice junction candidates as possible, and then, filtering the false positives with two effective filtering strategies. In both simulated and real datasets, SOAPsplice is able to detect many reliable splice junctions with low false positive rate. The improvement gained by SOAPsplice, when compared to other existing tools, becomes more obvious when the depth of sequencing is low. SOAPsplice is freely available at
PMCID: PMC3268599  PMID: 22303342
RNA-Seq; splice junction; spliced alignment
23.  Exome Sequencing Identifies ZNF644 Mutations in High Myopia 
PLoS Genetics  2011;7(6):e1002084.
Myopia is the most common ocular disorder worldwide, and high myopia in particular is one of the leading causes of blindness. Genetic factors play a critical role in the development of myopia, especially high myopia. Recently, the exome sequencing approach has been successfully used for the disease gene identification of Mendelian disorders. Here we show a successful application of exome sequencing to identify a gene for an autosomal dominant disorder, and we have identified a gene potentially responsible for high myopia in a monogenic form. We captured exomes of two affected individuals from a Han Chinese family with high myopia and performed sequencing analysis by a second-generation sequencer with a mean coverage of 30× and sufficient depth to call variants at ∼97% of each targeted exome. The shared genetic variants of these two affected individuals in the family being studied were filtered against the 1000 Genomes Project and the dbSNP131 database. A mutation A672G in zinc finger protein 644 isoform 1 (ZNF644) was identified as being related to the phenotype of this family. After we performed sequencing analysis of the exons in the ZNF644 gene in 300 sporadic cases of high myopia, we identified an additional five mutations (I587V, R680G, C699Y, 3′UTR+12 C>G, and 3′UTR+592 G>A) in 11 different patients. All these mutations were absent in 600 normal controls. The ZNF644 gene was expressed in human retinal and retinal pigment epithelium (RPE). Given that ZNF644 is predicted to be a transcription factor that may regulate genes involved in eye development, mutation may cause the axial elongation of eyeball found in high myopia patients. Our results suggest that ZNF644 might be a causal gene for high myopia in a monogenic form.
Author Summary
People with myopia see near objects more clearly than objects far away. Myopia is the most common ocular disorder worldwide, with a high prevalence in Asian (40%–70%) and Caucasian (20%–30%) populations. Although the etiologies of myopia have not yet been established, previous studies have indicated the involvement of genetic and environmental factors (such as close working habits, higher education levels, and higher socioeconomic class). Genetic factors play a critical role in the development of myopia, especially high myopia. In this study, we use exome sequencing, a powerful tool for a disease gene identification, to identify a gene involved in high myopia in a monogenic form among Han Chinese. Mutations in zinc finger protein 644 isoform 1 (ZNF644) were identified as potentially responsible for the phenotype of high myopia. The main feature of high myopia is axial elongation of the eye globe. Given that ZNF644 is predicted to be a transcription factor that may regulate genes involved in eye development, a mutant ZNF644 protein may impact the normal eye development and therefore may underlie the axial elongation of the eye globe in high myopia patients. Further study of the biological function of ZNF644 will provide insight into the pathogenesis of myopia.
PMCID: PMC3111487  PMID: 21695231
24.  Evolutionary Transients in the Rice Transcriptome 
In the canonical version of evolution by gene duplication, one copy is kept unaltered while the other is free to evolve. This process of evolutionary experimentation can persist for millions of years. Since it is so short lived in comparison to the lifetime of the core genes that make up the majority of most genomes, a substantial fraction of the genome and the transcriptome may—in principle—be attributable to what we will refer to as “evolutionary transients”, referring here to both the process and the genes that have gone or are undergoing this process. Using the rice gene set as a test case, we argue that this phenomenon goes a long way towards explaining why there are so many more rice genes than Arabidopsis genes, and why most excess rice genes show low similarity to eudicots.
PMCID: PMC5054128  PMID: 21382590
evolutionary transients; rice; gene duplication
25.  The DNA Methylome of Human Peripheral Blood Mononuclear Cells 
PLoS Biology  2010;8(11):e1000533.
Analysis across the genome of patterns of DNA methylation reveals a rich landscape of allele-specific epigenetic modification and consequent effects on allele-specific gene expression.
DNA methylation plays an important role in biological processes in human health and disease. Recent technological advances allow unbiased whole-genome DNA methylation (methylome) analysis to be carried out on human cells. Using whole-genome bisulfite sequencing at 24.7-fold coverage (12.3-fold per strand), we report a comprehensive (92.62%) methylome and analysis of the unique sequences in human peripheral blood mononuclear cells (PBMC) from the same Asian individual whose genome was deciphered in the YH project. PBMC constitute an important source for clinical blood tests world-wide. We found that 68.4% of CpG sites and <0.2% of non-CpG sites were methylated, demonstrating that non-CpG cytosine methylation is minor in human PBMC. Analysis of the PBMC methylome revealed a rich epigenomic landscape for 20 distinct genomic features, including regulatory, protein-coding, non-coding, RNA-coding, and repeat sequences. Integration of our methylome data with the YH genome sequence enabled a first comprehensive assessment of allele-specific methylation (ASM) between the two haploid methylomes of any individual and allowed the identification of 599 haploid differentially methylated regions (hDMRs) covering 287 genes. Of these, 76 genes had hDMRs within 2 kb of their transcriptional start sites of which >80% displayed allele-specific expression (ASE). These data demonstrate that ASM is a recurrent phenomenon and is highly correlated with ASE in human PBMCs. Together with recently reported similar studies, our study provides a comprehensive resource for future epigenomic research and confirms new sequencing technology as a paradigm for large-scale epigenomics studies.
Author Summary
Epigenetic modifications such as addition of methyl groups to cytosine in DNA play a role in regulating gene expression. To better understand these processes, knowledge of the methylation status of all cytosine bases in the genome (the methylome) is required. DNA methylation can differ between the two gene copies (alleles) in each cell. Such allele-specific methylation (ASM) can be due to parental origin of the alleles (imprinting), X chromosome inactivation in females, and other as yet unknown mechanisms. This may significantly alter the expression profile arising from different allele combinations in different individuals. Using advanced sequencing technology, we have determined the methylome of human peripheral blood mononuclear cells (PBMC). Importantly, the PBMC were obtained from the same male Han Chinese individual whose complete genome had previously been determined. This allowed us, for the first time, to study genome-wide differences in ASM. Our analysis shows that ASM in PBMC is higher than can be accounted for by regions known to undergo parent-of-origin imprinting and frequently (>80%) correlates with allele-specific expression (ASE) of the corresponding gene. In addition, our data reveal a rich landscape of epigenomic variation for 20 genomic features, including regulatory, coding, and non-coding sequences, and provide a valuable resource for future studies. Our work further establishes whole-genome sequencing as an efficient method for methylome analysis.
PMCID: PMC2976721  PMID: 21085693

