Search tips
Search criteria

Results 1-19 (19)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
1.  Integrated RNA-seq and sRNA-seq analysis identifies novel nitrate-responsive genes in Arabidopsis thaliana roots 
BMC Genomics  2013;14:701.
Nitrate and other nitrogen metabolites can act as signals that regulate global gene expression in plants. Adaptive changes in plant morphology and physiology triggered by changes in nitrate availability are partly explained by these changes in gene expression. Despite several genome-wide efforts to identify nitrate-regulated genes, no comprehensive study of the Arabidopsis root transcriptome under contrasting nitrate conditions has been carried out.
In this work, we employed the Illumina high throughput sequencing technology to perform an integrated analysis of the poly-A + enriched and the small RNA fractions of the Arabidopsis thaliana root transcriptome in response to nitrate treatments. Our sequencing strategy identified new nitrate-regulated genes including 40 genes not represented in the ATH1 Affymetrix GeneChip, a novel nitrate-responsive antisense transcript and a new nitrate responsive miRNA/TARGET module consisting of a novel microRNA, miR5640 and its target, AtPPC3.
Sequencing of small RNAs and mRNAs uncovered new genes, and enabled us to develop new hypotheses for nitrate regulation and coordination of carbon and nitrogen metabolism.
PMCID: PMC3906980  PMID: 24119003
Arabidopsis; Nitrate; RNA-seq; Roots; MicroRNA; Transcriptomics
2.  Hybrid error correction and de novo assembly of single-molecule sequencing reads 
Nature biotechnology  2012;30(7):693-700.
Emerging single-molecule sequencing instruments can generate multi-kilobase sequences with the potential to dramatically improve genome and transcriptome assembly. However, the high error rate of single-molecule reads is challenging, and has limited their use to resequencing bacteria. To address this limitation, we introduce a novel correction algorithm and assembly strategy that utilizes shorter, high-identity sequences to correct the error in single-molecule sequences. We demonstrate the utility of this approach on Pacbio RS reads of phage, prokaryotic, and eukaryotic whole genomes, including the novel genome of the parrot Melopsittacus undulatus, as well as for RNA-seq reads of the corn (Zea mays) transcriptome. Our approach achieves over 99.9% read correction accuracy and produces substantially better assemblies than current sequencing strategies: in the best example, quintupling the median contig size relative to high-coverage, second-generation assemblies. Greater gains are predicted if read lengths continue to increase, including the prospect of single-contig bacterial chromosome assembly.
PMCID: PMC3707490  PMID: 22750884
3.  Parallel comparison of Illumina RNA-Seq and Affymetrix microarray platforms on transcriptomic profiles generated from 5-aza-deoxy-cytidine treated HT-29 colon cancer cells and simulated datasets 
BMC Bioinformatics  2013;14(Suppl 9):S1.
High throughput parallel sequencing, RNA-Seq, has recently emerged as an appealing alternative to microarray in identifying differentially expressed genes (DEG) between biological groups. However, there still exists considerable discrepancy on gene expression measurements and DEG results between the two platforms. The objective of this study was to compare parallel paired-end RNA-Seq and microarray data generated on 5-azadeoxy-cytidine (5-Aza) treated HT-29 colon cancer cells with an additional simulation study.
We first performed general correlation analysis comparing gene expression profiles on both platforms. An Errors-In-Variables (EIV) regression model was subsequently applied to assess proportional and fixed biases between the two technologies. Then several existing algorithms, designed for DEG identification in RNA-Seq and microarray data, were applied to compare the cross-platform overlaps with respect to DEG lists, which were further validated using qRT-PCR assays on selected genes. Functional analyses were subsequently conducted using Ingenuity Pathway Analysis (IPA).
Pearson and Spearman correlation coefficients between the RNA-Seq and microarray data each exceeded 0.80, with 66%~68% overlap of genes on both platforms. The EIV regression model indicated the existence of both fixed and proportional biases between the two platforms. The DESeq and baySeq algorithms (RNA-Seq) and the SAM and eBayes algorithms (microarray) achieved the highest cross-platform overlap rate in DEG results from both experimental and simulated datasets. DESeq method exhibited a better control on the false discovery rate than baySeq on the simulated dataset although it performed slightly inferior to baySeq in the sensitivity test. RNA-Seq and qRT-PCR, but not microarray data, confirmed the expected reversal of SPARC gene suppression after treating HT-29 cells with 5-Aza. Thirty-three IPA canonical pathways were identified by both microarray and RNA-Seq data, 152 pathways by RNA-Seq data only, and none by microarray data only.
These results suggest that RNA-Seq has advantages over microarray in identification of DEGs with the most consistent results generated from DESeq and SAM methods. The EIV regression model reveals both fixed and proportional biases between RNA-Seq and microarray. This may explain in part the lower cross-platform overlap in DEG lists compared to those in detectable genes.
PMCID: PMC3697991  PMID: 23902433
4.  Current challenges in de novo plant genome sequencing and assembly 
Genome Biology  2012;13(4):243.
Genome sequencing is now affordable, but assembling plant genomes de novo remains challenging. We assess the state of the art of assembly and review the best practices for the community.
PMCID: PMC3446297  PMID: 22546054
DNA sequencing; genome assembly; plant genomics
5.  Chd5 Requires PHD-mediated Histone 3 Binding for Tumor Suppression 
Cell reports  2013;3(1):92-102.
ChromodomainHelicase DNA-binding protein 5 (CHD5) is a tumor suppressor mapping to 1p36—a genomic region frequently deleted in human cancer. Although CHD5 belongs to the CHD family of chromatin remodeling proteins, whether its tumor suppressive role involves an interaction with chromatin is unknown. Here we report that Chd5 binds the unmodified N-terminus of H3 through its tandem plant homeodomains (PHDs). Genome-wide ChIP studies reveal preferential binding of Chd5 to loci lacking the active mark H3K4me3, and also identify novel Chd5-targets implicated in cancer. Chd5 mutations abrogating H3 binding are unable to inhibit proliferation or to transcriptionally modulate target genes, leading to tumorigenesis in vivo. Unlike wild-type Chd5, Chd5-PHD mutants are unable to induce differentiation or to efficiently suppress growth of human neuroblastoma in vivo. Our work defines Chd5 as an N-terminally unmodified H3-binding protein and provides functional evidence that this interaction orchestrates chromatin-mediated transcriptional programs critical for tumor suppression.
PMCID: PMC3575599  PMID: 23318260
6.  A Hybrid Likelihood Model for Sequence-Based Disease Association Studies 
PLoS Genetics  2013;9(1):e1003224.
In the past few years, case-control studies of common diseases have shifted their focus from single genes to whole exomes. New sequencing technologies now routinely detect hundreds of thousands of sequence variants in a single study, many of which are rare or even novel. The limitation of classical single-marker association analysis for rare variants has been a challenge in such studies. A new generation of statistical methods for case-control association studies has been developed to meet this challenge. A common approach to association analysis of rare variants is the burden-style collapsing methods to combine rare variant data within individuals across or within genes. Here, we propose a new hybrid likelihood model that combines a burden test with a test of the position distribution of variants. In extensive simulations and on empirical data from the Dallas Heart Study, the new model demonstrates consistently good power, in particular when applied to a gene set (e.g., multiple candidate genes with shared biological function or pathway), when rare variants cluster in key functional regions of a gene, and when protective variants are present. When applied to data from an ongoing sequencing study of bipolar disorder (191 cases, 107 controls), the model identifies seven gene sets with nominal p-values0.05, of which one MAPK signaling pathway (KEGG) reaches trend-level significance after correcting for multiple testing.
Author Summary
Inexpensive, high-throughput sequencing has transformed the field of case-control association studies. For the first time, it may be possible to identify the genetic underpinnings of complex diseases, by sequencing the DNA of hundreds (even thousands) of cases and controls and comparing patterns of DNA sequence variation. However, complex diseases are likely to be caused by many variants, some of which are very rare. Taken one at a time, the association between variant and disease phenotype may not be detectable by current statistical methods. One strategy is to identify regions where important variants occur by “collapsing” variants into groups. Here, we present a new collapsing approach, capable of detecting subtle genetic differences between cases and controls. We show, in extensive simulations and using a benchmark set of genes involved in human triglyceride levels, that the approach is potentially more powerful than existing methods. We apply the new method to an ongoing sequencing study of bipolar cases and controls and identify a set of genes found in neuronal synapses, which may be implicated in bipolar disorder.
PMCID: PMC3554549  PMID: 23358228
7.  SpliceTrap: a method to quantify alternative splicing under single cellular conditions 
Bioinformatics  2011;27(21):3010-3016.
Motivation: Alternative splicing (AS) is a pre-mRNA maturation process leading to the expression of multiple mRNA variants from the same primary transcript. More than 90% of human genes are expressed via AS. Therefore, quantifying the inclusion level of every exon is crucial for generating accurate transcriptomic maps and studying the regulation of AS.
Results: Here we introduce SpliceTrap, a method to quantify exon inclusion levels using paired-end RNA-seq data. Unlike other tools, which focus on full-length transcript isoforms, SpliceTrap approaches the expression-level estimation of each exon as an independent Bayesian inference problem. In addition, SpliceTrap can identify major classes of alternative splicing events under a single cellular condition, without requiring a background set of reads to estimate relative splicing changes. We tested SpliceTrap both by simulation and real data analysis, and compared it to state-of-the-art tools for transcript quantification. SpliceTrap demonstrated improved accuracy, robustness and reliability in quantifying exon-inclusion ratios.
Conclusions: SpliceTrap is a useful tool to study alternative splicing regulation, especially for accurate quantification of local exon-inclusion ratios from RNA-seq data.
Availability and Implementation: SpliceTrap can be implemented online through the CSH Galaxy server and is also available for download and installation at
Supplementary Information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3198574  PMID: 21896509
8.  Directional DNA Methylation Changes and Complex Intermediate States Accompany Lineage Specificity in the Adult Hematopoietic Compartment 
Molecular cell  2011;44(1):17-28.
DNA methylation has been implicated as an epigenetic component of mechanisms that stabilize cell-fate decisions. Here, we have characterized the methylomes of human female hematopoietic stem/progenitor cells (HSPCs) and mature cells from the myeloid and lymphoid lineages. Hypomethylated regions (HMRs) associated with lineage-specific genes were often methylated in the opposing lineage. In HSPCs, these sites tended to show intermediate, complex patterns that resolve to uniformity upon differentiation, by increased or decreased methylation. Promoter HMRs shared across diverse cell types typically display a constitutive core that expands and contracts in a lineage-specific manner to fine-tune the expression of associated genes. Many newly identified intergenic HMRs, both constitutive and lineage specific, were enriched for factor binding sites with an implied role in genome organization and regulation of gene expression, respectively. Overall, our studies represent an important reference data set and provide insights into directional changes in DNA methylation as cells adopt terminal fates.
PMCID: PMC3412369  PMID: 21924933
9.  Sperm methylation profiles reveal features of epigenetic inheritance and evolution in primates 
Cell  2011;146(6):1029-1041.
During germ cell and preimplantation development, mammalian cells undergo nearly complete reprogramming of DNA methylation patterns. We profiled the methylomes of human and chimp sperm as a basis for comparison to methylation patterns of ES cells. While the majority of promoters escape methylation in both ES cells and sperm, the corresponding hypomethylated regions show substantial structural differences. Repeat elements are heavily methylated in both germ and somatic cells; however, retrotransposons from several subfamilies evade methylation more effectively during male germ cell development, while other subfamilies show the opposite trend. Comparing methylomes of human and chimp sperm revealed a subset of differentially methylated promoters and strikingly divergent methylation in retrotransposon subfamilies, with an evolutionary impact that is apparent in the underlying genomic sequence. Thus, the features that determine DNA methylation patterns differ between male germ cells and somatic cells, and elements of these features have diverged between humans and chimpanzees.
PMCID: PMC3205962  PMID: 21925323
10.  Establishing the baseline level of repetitive element expression in the human cortex 
BMC Genomics  2011;12:495.
Although nearly half of the human genome is comprised of repetitive sequences, the expression profile of these elements remains largely uncharacterized. Recently developed high throughput sequencing technologies provide us with a powerful new set of tools to study repeat elements. Hence, we performed whole transcriptome sequencing to investigate the expression of repetitive elements in human frontal cortex using postmortem tissue obtained from the Stanley Medical Research Institute.
We found a significant amount of reads from the human frontal cortex originate from repeat elements. We also noticed that Alu elements were expressed at levels higher than expected by random or background transcription. In contrast, L1 elements were expressed at lower than expected amounts.
Repetitive elements are expressed abundantly in the human brain. This expression pattern appears to be element specific and can not be explained by random or background transcription. These results demonstrate that our knowledge about repetitive elements is far from complete. Further characterization is required to determine the mechanism, the control, and the effects of repeat element expression.
PMCID: PMC3207997  PMID: 21985647
11.  A comparative analysis of exome capture 
Genome Biology  2011;12(9):R97.
Human exome resequencing using commercial target capture kits has been and is being used for sequencing large numbers of individuals to search for variants associated with various human diseases. We rigorously evaluated the capabilities of two solution exome capture kits. These analyses help clarify the strengths and limitations of those data as well as systematically identify variables that should be considered in the use of those data.
Each exome kit performed well at capturing the targets they were designed to capture, which mainly corresponds to the consensus coding sequences (CCDS) annotations of the human genome. In addition, based on their respective targets, each capture kit coupled with high coverage Illumina sequencing produced highly accurate nucleotide calls. However, other databases, such as the Reference Sequence collection (RefSeq), define the exome more broadly, and so not surprisingly, the exome kits did not capture these additional regions.
Commercial exome capture kits provide a very efficient way to sequence select areas of the genome at very high accuracy. Here we provide the data to help guide critical analyses of sequencing data derived from these products.
PMCID: PMC3308060  PMID: 21958622
12.  Hybrid selection of discrete genomic intervals on custom-designed microarrays for massively parallel sequencing 
Nature protocols  2009;4(6):960-974.
Complementary techniques that deepen information content and minimize reagent costs are required to realize the full potential of massively parallel sequencing. Here, we describe a resequencing approach that directs focus to genomic regions of high interest by combining hybridization-based purification of multi-megabase regions with sequencing on the Illumina Genome Analyzer (GA). The capture matrix is created by a microarray on which probes can be programmed as desired to target any non-repeat portion of the genome, while the method requires only a basic familiarity with microarray hybridization. We present a detailed protocol suitable for 1–2 µg of input genomic DNA and highlight key design tips in which high specificity (>65% of reads stem from enriched exons) and high sensitivity (98% targeted base pair coverage) can be achieved. We have successfully applied this to the enrichment of coding regions, in both human and mouse, ranging from 0.5 to 4 Mb in length. From genomic DNA library production to base-called sequences, this procedure takes approximately 9–10 d inclusive of array captures and one Illumina flow cell run.
PMCID: PMC2990409  PMID: 19478811
13.  Alta-Cyclic: a self-optimizing base caller for next-generation sequencing 
Nature methods  2008;5(8):679-682.
Next-generation sequencing is limited to short read lengths and by high error rates. We systematically analyzed sources of noise in the Illumina Genome Analyzer that contribute to these high error rates and developed a base caller, Alta-Cyclic, that uses machine learning to compensate for noise factors. Alta-Cyclic substantially improved the number of accurate reads for sequencing runs up to 78 bases and reduced systematic biases, facilitating confident identification of sequence variants.
PMCID: PMC2978646  PMID: 18604217
14.  Specialized piRNA Pathways Act in Germline and Somatic Tissues of the Drosophila Ovary 
Cell  2009;137(3):522-535.
In Drosophila gonads, Piwi proteins and associated piRNAs collaborate with additional factors to form a small RNA-based immune system that silences mobile elements. Here, we analyzed nine Drosophila piRNA pathway mutants for their impacts on both small RNA populations and the subcellular localization patterns of Piwi proteins. We find that distinct piRNA pathways with differing components function in ovarian germ and somatic cells. In the soma, Piwi acts singularly with the conserved flamenco piRNA cluster to enforce silencing of retroviral elements that may propagate by infecting neighboring germ cells. In the germline, silencing programs encoded within piRNA clusters are optimized via a slicer-dependent amplification loop to suppress a broad spectrum of elements. The classes of transposons targeted by germline and somatic piRNA clusters, though not the precise elements, are conserved among Drosophilids, demonstrating that the architecture of piRNA clusters has coevolved with the transposons that they are tasked to control.
PMCID: PMC2882632  PMID: 19395010
15.  Epigenetic Natural Variation in Arabidopsis thaliana 
PLoS Biology  2007;5(7):e174.
Cytosine methylation of repetitive sequences is widespread in plant genomes, occurring in both symmetric (CpG and CpNpG) as well as asymmetric sequence contexts. We used the methylation-dependent restriction enzyme McrBC to profile methylated DNA using tiling microarrays of Arabidopsis Chromosome 4 in two distinct ecotypes, Columbia and Landsberg erecta. We also used comparative genome hybridization to profile copy number polymorphisms. Repeated sequences and transposable elements (TEs), especially long terminal repeat retrotransposons, are densely methylated, but one third of genes also have low but detectable methylation in their transcribed regions. While TEs are almost always methylated, genic methylation is highly polymorphic, with half of all methylated genes being methylated in only one of the two ecotypes. A survey of loci in 96 Arabidopsis accessions revealed a similar degree of methylation polymorphism. Within-gene methylation is heritable, but is lost at a high frequency in segregating F2 families. Promoter methylation is rare, and gene expression is not generally affected by differences in DNA methylation. Small interfering RNA are preferentially associated with methylated TEs, but not with methylated genes, indicating that most genic methylation is not guided by small interfering RNA. This may account for the instability of gene methylation, if occasional failure of maintenance methylation cannot be restored by other means.
Author Summary
In plants and animals, many DNA sequences are modified by the addition of methyl groups, but the principles governing methylation patterns are not well understood. In Arabidopsis, we show that repetitive sequences, derived from mobile (transposable) elements, are densely methylated throughout their length, while about one third of all protein-coding genes are internally methylated. Methylated transposons are silent, homologous to small interfering RNA, and coated with histone H3 dimethylated on lysine-9. In contrast, methylated coding-sequence genes are highly expressed, do not have corresponding small RNAs, and are coated with histone H3 dimethylated on lysine-4. Comparing two different ecotypes of Arabidopsis, we find that transposons are twice as likely as genes to have suffered insertion and deletion, although gene deletion is surprisingly prevalent. While the pattern of transposon methylation is conserved between ecotypes, protein-coding gene methylation is polymorphic so that only half of all gene methylation on any one chromosome is shared between natural accessions collected from around the world.
Two ecotypes ofArabidopis show different patterns of DNA methylation, which is heritable. Interestingly, differences in DNA methylation are not reflected in differences in gene expression.
PMCID: PMC1892575  PMID: 17579518
16.  On the importance of being finished 
Genome Biology  2002;3(10):comment2010.1-comment2010.4.
The publication of an increasing number of draft genome sequences presents problems that will only be resolved by improved search tools and by complete finishing of the sequences - and their deposition in publicly accessible databases.
The publication of an increasing number of draft genome sequences presents problems that will only be resolved by improved search tools and by complete finishing of the sequences - and their deposition in publicly accessible databases.
PMCID: PMC244905  PMID: 12372139
17.  De Novo Gene Disruptions in Children on the Autistic Spectrum 
Neuron  2012;74(2):285-299.
Exome sequencing of 343 families, each with a single child on the autism spectrum and at least one unaffected sibling, reveal de novo small indels and point substitutions, which come mostly from the paternal line in an age-dependent manner. We do not see significantly greater numbers of de novo missense mutations in affected versus unaffected children, but gene-disrupting mutations (nonsense, splice site, and frame shifts) are twice as frequent, 59 to 28. Based on this differential and the number of recurrent and total targets of gene disruption found in our and similar studies, we estimate between 350 and 400 autism susceptibility genes. Many of the disrupted genes in these studies are associated with the fragile X protein, FMRP, reinforcing links between autism and synaptic plasticity. We find FMRP-associated genes are under greater purifying selection than the remainder of genes and suggest they are especially dosage-sensitive targets of cognitive disorders.
PMCID: PMC3619976  PMID: 22542183
18.  A Functional Phylogenomic View of the Seed Plants 
PLoS Genetics  2011;7(12):e1002411.
A novel result of the current research is the development and implementation of a unique functional phylogenomic approach that explores the genomic origins of seed plant diversification. We first use 22,833 sets of orthologs from the nuclear genomes of 101 genera across land plants to reconstruct their phylogenetic relationships. One of the more salient results is the resolution of some enigmatic relationships in seed plant phylogeny, such as the placement of Gnetales as sister to the rest of the gymnosperms. In using this novel phylogenomic approach, we were also able to identify overrepresented functional gene ontology categories in genes that provide positive branch support for major nodes prompting new hypotheses for genes associated with the diversification of angiosperms. For example, RNA interference (RNAi) has played a significant role in the divergence of monocots from other angiosperms, which has experimental support in Arabidopsis and rice. This analysis also implied that the second largest subunit of RNA polymerase IV and V (NRPD2) played a prominent role in the divergence of gymnosperms. This hypothesis is supported by the lack of 24nt siRNA in conifers, the maternal control of small RNA in the seeds of flowering plants, and the emergence of double fertilization in angiosperms. Our approach takes advantage of genomic data to define orthologs, reconstruct relationships, and narrow down candidate genes involved in plant evolution within a phylogenomic view of species' diversification.
Author Summary
Understanding the genetic and genomic basis of plant diversification has been a major goal of evolutionary biologists since Darwin first pondered his “abominable mystery,” the rapid diversification of the angiosperms in the fossil record. We develop and deploy a functional phylogenomic approach that helps identify genes and biological processes putatively involved in species diversification. We assembled a matrix of 22,833 orthologs from 150 species to reconstruct seed plant phylogenetic relationships and to identify gene sets with a unique evolutionary signal. Our analysis of overrepresented biological processes in these sets narrowed down possible genetic mechanisms underlying plant adaptation and diversification. The phylogenetic relationships we uncovered support the hypothesis that gnetophytes are closely related to the rest of the gymnosperms at the base of the living seed plants. We also found that genes involved in post-transcriptional silencing via RNA interference (RNAi)—increasingly important in understanding plant evolution—are significantly represented early in angiosperm and gymnosperm divergence, with an apparent loss of specific classes of small interfering RNAs (siRNA) in gymnosperms. Our functional phylogenomic approach can be applied to any taxa with available sequences to enhance our knowledge of the evolutionary processes underlying biodiversity in general.
PMCID: PMC3240601  PMID: 22194700
19.  Sorghum Genome Sequencing by Methylation Filtration 
PLoS Biology  2005;3(1):e13.
Sorghum bicolor is a close relative of maize and is a staple crop in Africa and much of the developing world because of its superior tolerance of arid growth conditions. We have generated sequence from the hypomethylated portion of the sorghum genome by applying methylation filtration (MF) technology. The evidence suggests that 96% of the genes have been sequence tagged, with an average coverage of 65% across their length. Remarkably, this level of gene discovery was accomplished after generating a raw coverage of less than 300 megabases of the 735-megabase genome. MF preferentially captures exons and introns, promoters, microRNAs, and simple sequence repeats, and minimizes interspersed repeats, thus providing a robust view of the functional parts of the genome. The sorghum MF sequence set is beneficial to research on sorghum and is also a powerful resource for comparative genomics among the grasses and across the entire plant kingdom. Thousands of hypothetical gene predictions in rice and Arabidopsis are supported by the sorghum dataset, and genomic similarities highlight evolutionarily conserved regions that will lead to a better understanding of rice and Arabidopsis.
Methylation filtration makes practical the sequencing of large genomes, such as those found in sorghum, by preferentially capturing functionally relevant sequences
PMCID: PMC539327  PMID: 15660154

Results 1-19 (19)