Search tips
Search criteria

Results 1-25 (47)

Clipboard (0)

Select a Filter Below

Year of Publication
more »
author:("Tang, haifu")
1.  Stranded Whole Transcriptome RNA-Seq for All RNA Types 
Stranded whole transcriptome RNA-Seq described in this unit captures quantitative expression data for all types of RNA including, but not limited to miRNA (microRNA), piRNA (Piwi-interacting RNA), snoRNA (small nucleolar RNA), lincRNA (large non-coding intergenic RNA), SRP RNA (signal recognition particle RNA), tRNA (transfer RNA), mtRNA (mitochondrial RNA) and mRNA (messenger RNA). The size and nature of these types of RNA are irrelevant to the approach described here. Barcoded libraries for multiplexing on the Illumina platform are generated with this approach but it can be applied to other platforms with a few modifications.
PMCID: PMC4337225  PMID: 25599667
RNA-Seq; transcriptome; gene expression; Duplex-specific nuclease
2.  Automated Annotation and Quantitation of Glycan by LC-ESI-MS analysis using MultiGlycan-ESI Computational Tool 
LC-MS is currently considered to be a conventional glycomics analysis strategy due to the high sensitivity and ability to handle complex biological samples. Interpretation of LC-MS data is a major bottleneck in high-throughput glycomics LC-MS based analysis. The complexity of LC-MS data associated with biological samples prompts the needs to develop computational tools capable of facilitating automated data annotation and quantitation.
An LC-MS based automated data annotation and quantitation software, MultiGlycan-ESI, was developed and utilized for glycan quantitation. Data generated by the software from LC-MS analysis of permethylated N-glycans derived from fetuin were initially validated by manual integration to assess the performance of the software. The performance of MultiGlycan-ESI was then assessed for the quantitation of permethylated fetuin N-glycans analyzed at different concentrations or spiked with permethylated N-glycans derived from human blood serum.
The relative abundance differences between data generated by the software and those generated by manual integration were less than 5%, indicating the reliability of MultiGlycan-ESI in quantitation of permethylated glycans analyzed by LC-MS. Automated quantitation resulted in a linear relationship for all six N-glycans derived from 50 ng to 400 ng fetuin with correlation coefficients (R2) greater than 0.93. Spiking of permethylated fetuin N-glycans at different concentrations in permethylated N-glycan samples derived from a 0.02 µL of HBS also exhibited linear agreement with R2 values greater than 0.9.
With a variety of options, including mass accuracy, merged adducts, and filtering criteria, MultiGlycan-ESI allows automated annotation and quantitation of LC-ESI-MS N-glycan data. The software allows the reliable quantitation of glycans LC-MS data. The software is reliable for automated glycan quantitation, thus facilitating rapid and reliable high-throughput glycomics studies.
PMCID: PMC4516131  PMID: 25462374
MultiGlycan-ESI; Glycans; Glycoproteins; Permethylation; LC-MS; Quantitation
3.  Choosing blindly but wisely: differentially private solicitation of DNA datasets for disease marker discovery 
Objective To propose a new approach to privacy preserving data selection, which helps the data users access human genomic datasets efficiently without undermining patients’ privacy.
Methods Our idea is to let each data owner publish a set of differentially-private pilot data, on which a data user can test-run arbitrary association-test algorithms, including those not known to the data owner a priori. We developed a suite of new techniques, including a pilot-data generation approach that leverages the linkage disequilibrium in the human genome to preserve both the utility of the data and the privacy of the patients, and a utility evaluation method that helps the user assess the value of the real data from its pilot version with high confidence.
Results We evaluated our approach on real human genomic data using four popular association tests. Our study shows that the proposed approach can help data users make the right choices in most cases.
Conclusions Even though the pilot data cannot be directly used for scientific discovery, it provides a useful indication of which datasets are more likely to be useful to data users, who can therefore approach the appropriate data owners to gain access to the data.
PMCID: PMC4433380  PMID: 25352565
Privacy-preserving techniques; Genome-wide association studies; Differential Privacy; Test statistics; Single nucleotide polymorphisms (SNPs); Haplotype blocks
4.  Glycoproteomics: Identifying the Glycosylation of Prostate Specific Antigen at Normal and High Isoelectric Points by LC–MS/MS 
Journal of Proteome Research  2014;13(12):5570-5580.
Prostate specific antigen (PSA) is currently used as a biomarker to diagnose prostate cancer. PSA testing has been widely used to detect and screen prostate cancer. However, in the diagnostic gray zone, the PSA test does not clearly distinguish between benign prostate hypertrophy and prostate cancer due to their overlap. To develop more specific and sensitive candidate biomarkers for prostate cancer, an in-depth understanding of the biochemical characteristics of PSA (such as glycosylation) is needed. PSA has a single glycosylation site at Asn69, with glycans constituting approximately 8% of the protein by weight. Here, we report the comprehensive identification and quantitation of N-glycans from two PSA isoforms using LC–MS/MS. There were 56 N-glycans associated with PSA, whereas 57 N-glycans were observed in the case of the PSA-high isoelectric point (pI) isoform (PSAH). Three sulfated/phosphorylated glycopeptides were detected, the identification of which was supported by tandem MS data. One of these sulfated/phosphorylated N-glycans, HexNAc5Hex4dHex1s/p1 was identified in both PSA and PSAH at relative intensities of 0.52 and 0.28%, respectively. Quantitatively, the variations were monitored between these two isoforms. Because we were one of the laboratories participating in the 2012 ABRF Glycoprotein Research Group (gPRG) study, those results were compared to that presented in this study. Our qualitative and quantitative results summarized here were comparable to those that were summarized in the interlaboratory study.
PMCID: PMC4261947  PMID: 25327667
Prostate specific antigen; PSA; N-linked glycosylation; glycopeptide; glycoproteomics; LC−MS/MS
5.  DNA sequence templates adjacent nucleosome and ORC sites at gene amplification origins in Drosophila 
Nucleic Acids Research  2015;43(18):8746-8761.
Eukaryotic origins of DNA replication are bound by the origin recognition complex (ORC), which scaffolds assembly of a pre-replicative complex (pre-RC) that is then activated to initiate replication. Both pre-RC assembly and activation are strongly influenced by developmental changes to the epigenome, but molecular mechanisms remain incompletely defined. We have been examining the activation of origins responsible for developmental gene amplification in Drosophila. At a specific time in oogenesis, somatic follicle cells transition from genomic replication to a locus-specific replication from six amplicon origins. Previous evidence indicated that these amplicon origins are activated by nucleosome acetylation, but how this affects origin chromatin is unknown. Here, we examine nucleosome position in follicle cells using micrococcal nuclease digestion with Ilumina sequencing. The results indicate that ORC binding sites and other essential origin sequences are nucleosome-depleted regions (NDRs). Nucleosome position at the amplicons was highly similar among developmental stages during which ORC is or is not bound, indicating that being an NDR is not sufficient to specify ORC binding. Importantly, the data suggest that nucleosomes and ORC have opposite preferences for DNA sequence and structure. We propose that nucleosome hyperacetylation promotes pre-RC assembly onto adjacent DNA sequences that are disfavored by nucleosomes but favored by ORC.
PMCID: PMC4605296  PMID: 26227968
6.  Identification of Pol IV and RDR2-dependent precursors of 24 nt siRNAs guiding de novo DNA methylation in Arabidopsis 
eLife  null;4:e09591.
In Arabidopsis thaliana, abundant 24 nucleotide small interfering RNAs (24 nt siRNA) guide the cytosine methylation and silencing of transposons and a subset of genes. 24 nt siRNA biogenesis requires nuclear RNA polymerase IV (Pol IV), RNA-dependent RNA polymerase 2 (RDR2) and DICER-like 3 (DCL3). However, siRNA precursors are mostly undefined. We identified Pol IV and RDR2-dependent RNAs (P4R2 RNAs) that accumulate in dcl3 mutants and are diced into 24 nt RNAs by DCL3 in vitro. P4R2 RNAs are mostly 26-45 nt and initiate with a purine adjacent to a pyrimidine, characteristics shared by Pol IV transcripts generated in vitro. RDR2 terminal transferase activity, also demonstrated in vitro, may account for occasional non-templated nucleotides at P4R2 RNA 3’ termini. The 24 nt siRNAs primarily correspond to the 5’ or 3’ ends of P4R2 RNAs, suggesting a model whereby siRNAs are generated from either end of P4R2 duplexes by single dicing events.
eLife digest
Genes contain instructions for processes in cells and therefore their activities must be carefully controlled. The addition of small chemical tags called methyl groups to DNA is one of the many ways by which cells can influence gene activity. These methyl groups can silence genes by altering the DNA so that is more tightly packed within the nucleus of the cell. Virus genes and mobile sections of DNA called transposable elements (sometimes known as jumping genes) are also silenced by DNA methylation to keep them from doing harm.
In plants, methyl groups can be attached to DNA by proteins that are guided to the DNA by molecules called short interfering ribonucleic acids (or siRNAs for short). Each siRNA is made of a chain of 24 building blocks called nucleotides and is able to bind to matching RNA molecules that are attached to the target DNA. The siRNAs are made from longer RNA molecules in a process that involves trimming by an enzyme called DCL3. However, it is not clear how long these “precursor” molecules are before DCL3 cuts them down to size.
Here, Blevins, Podicheti et al. studied how siRNAs are made in a plant called Arabidopsis thaliana. The experiments show that RNAs containing around 26-45 nucleotides accumulate in cells that lack DCL3 and these cells are unable to make 24 nucleotide long siRNAs. Furthermore, the purified DCL3 enzyme can cut these precursor RNAs to make the siRNAs. Because the precursors are relatively short, the experiments suggest that DCL3 only cuts each precursor RNA once when making siRNAs.
Blevins, Podicheti et al. also show that the siRNA precursors are made by a partnership of two RNA synthesizing enzymes. Therefore, a challenge for the future will be to understand exactly how they work together.
PMCID: PMC4716838  PMID: 26430765
RNA silencing; epigenetics; RNA-directed DNA methylation; noncoding RNA; RNA interference; transcription; Arabidopsis
7.  A two-step process for epigenetic inheritance in Arabidopsis 
Molecular cell  2014;54(1):30-42.
In Arabidopsis, multisubunit RNA polymerases IV and V orchestrate RNA-directed DNA methylation (RdDM) and transcriptional silencing, but what identifies the loci to be silenced is unclear. We show that heritable silent locus identity at a specific subset of RdDM targets requires HISTONE DEACETYLASE 6 (HDA6) acting upstream of Pol IV recruitment and siRNA biogenesis. At these loci, epigenetic memory conferring silent locus identity is erased in hda6 mutants such that restoration of HDA6 activity cannot restore siRNA biogenesis or silencing. Silent locus identity is similarly lost in mutants for the cytosine maintenance methyltransferase, MET1. By contrast, pol IV or pol V mutants disrupt silencing without erasing silent locus identity, allowing restoration of Pol IV or Pol V function to restore silencing. Collectively, these observations indicate that silent locus specification and silencing are separable steps that together account for epigenetic inheritance of the silenced state.
PMCID: PMC3988221  PMID: 24657166
8.  Secure Genomic Computation through Site-Wise Encryption 
Commercial clouds provide on-demand IT services for big-data analysis, which have become an attractive option for users who have no access to comparable infrastructure. However, utilizing these services for human genome analysis is highly risky, as human genomic data contains identifiable information of human individuals and their disease susceptibility. Therefore, currently, no computation on personal human genomic data is conducted on public clouds. To address this issue, here we present a site-wise encryption approach to encrypt whole human genome sequences, which can be subject to secure searching of genomic signatures on public clouds. We implemented this method within the Hadoop framework, and tested it on the case of searching disease markers retrieved from the ClinVar database against patients’ genomic sequences. The secure search runs only one order of magnitude slower than the simple search without encryption, indicating our method is ready to be used for secure genomic computation on public clouds.
PMCID: PMC4525260  PMID: 26306278
9.  A New Method for Stranded Whole Transcriptome RNA-seq 
Methods (San Diego, Calif.)  2013;63(2):126-134.
This report describes an improved protocol to generate stranded, barcoded RNA-seq libraries to capture the whole transcriptome. By optimizing the use of duplex specific nuclease (DSN) to remove ribosomal RNA reads from stranded barcoded libraries, we demonstrate improved efficiency of multiplexed next generation sequencing (NGS). This approach detects expression profiles of all RNA types, including miRNA (microRNA), piRNA (Piwi-interacting RNA), snoRNA (small nucleolar RNA), lincRNA (long non-coding RNA), mtRNA (mitochondrial RNA) and mRNA (messenger RNA) without the use of gel electrophoresis. The improved protocol generates high quality data that can be used to identify differential expression in known and novel coding and non-coding transcripts, splice variants, mitochondrial genes and SNPs (single nucleotide polymorphisms).
PMCID: PMC3739992  PMID: 23557989
RNA-seq; transcriptome; duplex-specific nuclease; gene expression1
10.  Gene finding in metatranscriptomic sequences 
BMC Bioinformatics  2014;15(Suppl 9):S8.
Metatranscriptomic sequencing is a highly sensitive bioassay of functional activity in a microbial community, providing complementary information to the metagenomic sequencing of the community. The acquisition of the metatranscriptomic sequences will enable us to refine the annotations of the metagenomes, and to study the gene activities and their regulation in complex microbial communities and their dynamics.
In this paper, we present TransGeneScan, a software tool for finding genes in assembled transcripts from metatranscriptomic sequences. By incorporating several features of metatranscriptomic sequencing, including strand-specificity, short intergenic regions, and putative antisense transcripts into a Hidden Markov Model, TranGeneScan can predict a sense transcript containing one or multiple genes (in an operon) or an antisense transcript.
We tested TransGeneScan on a mock metatranscriptomic data set containing three known bacterial genomes. The results showed that TranGeneScan performs better than metagenomic gene finders (MetaGeneMark and FragGeneScan) on predicting protein coding genes in assembled transcripts, and achieves comparable or even higher accuracy than gene finders for microbial genomes (Glimmer and GeneMark). These results imply, with the assistance of metatranscriptomic sequencing, we can obtain a broad and precise picture about the genes (and their functions) in a microbial community.
TransGeneScan is available as open-source software on SourceForge at
PMCID: PMC4168707  PMID: 25253067
Metatranscriptomics; Gene finding; Hidden Markov Model; Operons; Antisense RNA (asRNA)
11.  Automated annotation and quantification of glycans using liquid chromatography–mass spectrometry 
Bioinformatics  2013;29(13):1706-1707.
Summary: As a common post-translational modification, protein glycosylation plays an important role in many biological processes, and it is known to be associated with human diseases. Mass spectrometry (MS)-based glycomic profiling techniques have been developed to measure the abundances of glycans in complex biological samples and applied to the discovery of putative glycan biomarkers. To automate the annotation of glycomic profiles in the liquid chromatography-MS (LC–MS) data, we present here a user-friendly software tool, MultiGlycan, implemented in C# on Windows systems. We tested MultiGlycan by using several glycomic profiling datasets acquired using LC–MS under different preparations and show that MultiGlycan executes fast and generates robust and reliable results.
Availability: MultiGlycan can be freely downloaded at
Contact: or
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4542666  PMID: 23610369
12.  Protein identification problem from a Bayesian point of view 
Statistics and its interface  2012;5(1):21-37.
We present a generic Bayesian framework for the peptide and protein identification in proteomics, and provide a unified interpretation for the database searching and the de novo peptide sequencing approaches that are used in peptide identification. We describe several probabilistic graphical models and a variety of prior distributions that can be incorporated into the Bayesian framework to model different types of prior information, such as the known protein sequences, the known protein abundances, the peptide precursor masses, the estimated peptide retention time and the peptide detectabilities. Various applications of the Bayesian framework are discussed theoretically, including its application to the identification of peptides containing mutations and post-translational modifications.
PMCID: PMC3992622  PMID: 24761189
Shotgun proteomics; Protein identification; Mass spectrometry; Bayesian methods
13.  Extending the coverage of spectral libraries: a neighbor-based approach to predicting intensities of peptide fragmentation spectra 
Proteomics  2013;13(5):756-765.
Searching spectral libraries in tandem mass spectrometry (MS/MS) is an important new approach to improving the quality of peptide and protein identification. The idea relies on the observation that ion intensities in an MS/MS spectrum of a given peptide are generally reproducible across experiments, and thus, matching between spectra from an experiment and the spectra of previously identified peptides stored in a spectral library can lead to better peptide identification compared to the traditional database search. However, the use of libraries is greatly limited by their coverage of peptide sequences: even for well-studied organisms a large fraction of peptides have not been previously identified. To address this issue, we propose to expand spectral libraries by predicting the MS/MS spectra of peptides based on the spectra of peptides with similar sequences. We first demonstrate that the intensity patterns of dominant fragment ions between similar peptides tend to be similar. In accordance with this observation, we develop a neighbor-based approach which first selects peptides that are likely to have spectra similar to the target peptide and then combines their spectra using a weighted K-nearest neighbor method to accurately predict fragment ion intensities corresponding to the target peptide. This approach has the potential to predict spectra for every peptide in the proteome. When rigorous quality criteria are applied, we estimate that the method increases the coverage of spectral libraries available from the National Institute of Standards and Technology by 20–60%, although the values vary with peptide length and charge state. We find that the overall best search performance is achieved when spectral libraries are supplemented by the high quality predicted spectra.
PMCID: PMC3733334  PMID: 23303707
14.  Software tools for glycan profiling 
Methods in molecular biology (Clifton, N.J.)  2013;951:10.1007/978-1-62703-146-2_18.
PMCID: PMC3861397  PMID: 23296537
15.  A de Bruijn Graph Approach to the Quantification of Closely-Related Genomes in a Microbial Community 
Journal of Computational Biology  2012;19(6):814-825.
The wide applications of next-generation sequencing (NGS) technologies in metagenomics have raised many computational challenges. One of the essential problems in metagenomics is to estimate the taxonomic composition of a microbial community, which can be approached by mapping shotgun reads acquired from the community to previously characterized microbial genomes followed by quantity profiling of these species based on the number of mapped reads. This procedure, however, is not as trivial as it appears at first glance. A shotgun metagenomic dataset often contains DNA sequences from many closely-related microbial species (e.g., within the same genus) or strains (e.g., within the same species), thus it is often difficult to determine which species/strain a specific read is sampled from when it can be mapped to a common region shared by multiple genomes at high similarity. Furthermore, high genomic variations are observed among individual genomes within the same species, which are difficult to be differentiated from the inter-species variations during reads mapping. To address these issues, a commonly used approach is to quantify taxonomic distribution only at the genus level, based on the reads mapped to all species belonging to the same genus; alternatively, reads are mapped to a set of representative genomes, each selected to represent a different genus. Here, we introduce a novel approach to the quantity estimation of closely-related species within the same genus by mapping the reads to their genomes represented by a de Bruijn graph, in which the common genomic regions among them are collapsed. Using simulated and real metagenomic datasets, we show the de Bruijn graph approach has several advantages over existing methods, including (1) it avoids redundant mapping of shotgun reads to multiple copies of the common regions in different genomes, and (2) it leads to more accurate quantification for the closely-related species (and even for strains within the same species).
PMCID: PMC3375647  PMID: 22697249
closely-related genomes; de Bruijn graph; metagenomics; quantification
16.  CRISPR-Cas systems target a diverse collection of invasive mobile genetic elements in human microbiomes 
Genome Biology  2013;14(4):R40.
Bacteria and archaea develop immunity against invading genomes by incorporating pieces of the invaders' sequences, called spacers, into a clustered regularly interspaced short palindromic repeats (CRISPR) locus between repeats, forming arrays of repeat-spacer units. When spacers are expressed, they direct CRISPR-associated (Cas) proteins to silence complementary invading DNA. In order to characterize the invaders of human microbiomes, we use spacers from CRISPR arrays that we had previously assembled from shotgun metagenomic datasets, and identify contigs that contain these spacers' targets.
We discover 95,000 contigs that are putative invasive mobile genetic elements, some targeted by hundreds of CRISPR spacers. We find that oral sites in healthy human populations have a much greater variety of mobile genetic elements than stool samples. Mobile genetic elements carry genes encoding diverse functions: only 7% of the mobile genetic elements are similar to known phages or plasmids, although a much greater proportion contain phage- or plasmid-related genes. A small number of contigs share similarity with known integrative and conjugative elements, providing the first examples of CRISPR defenses against this class of element. We provide detailed analyses of a few large mobile genetic elements of various types, and a relative abundance analysis of mobile genetic elements and putative hosts, exploring the dynamic activities of mobile genetic elements in human microbiomes. A joint analysis of mobile genetic elements and CRISPRs shows that protospacer-adjacent motifs drive their interaction network; however, some CRISPR-Cas systems target mobile genetic elements lacking motifs.
We identify a large collection of invasive mobile genetic elements in human microbiomes, an important resource for further study of the interaction between the CRISPR-Cas immune system and invaders.
PMCID: PMC4053933  PMID: 23628424
CRISPR-Cas system; human microbiome; mobile genetic element (MGE)
17.  N-Glycan Profiling by Microchip Electrophoresis to Differentiate Disease-States Related to Esophageal Adenocarcinoma 
Analytical Chemistry  2012;84(8):3621-3627.
We report analysis of N-glycans derived from disease-free individuals and patients with Barrett's esophagus, high-grade dysplasia, and esophageal adenocarcinoma by microchip electrophoresis with laser-induced fluorescence detection. Serum samples in 10-μL aliquots are enzymatically treated to cleave the N-glycans that are subsequently reacted with 8-aminopyrene-1,3,6-trisulfonic acid to add charge and a fluorescent label. Separations at 1250 V/cm and over 22 cm yielded efficiencies up to 700,000 plates for the N-glycans and analysis times under 100 s. Principal component analysis (PCA) and analysis of variance (ANOVA) tests of the peak areas and migration times are used to evaluate N-glycan profiles from native and desialylated samples and to determine differences among the four sample groups. With microchip electrophoresis, we are able to distinguish the three patient groups from each other and from disease-free individuals.
PMCID: PMC3339272  PMID: 22397697
microfluidics; microchip electrophoresis; N-glycans; glycan profiling; disease-state monitoring; Barrett's esophagus; high-grade dysplasia; esophageal adenocarcinoma
18.  Testosterone Affects Neural Gene Expression Differently in Male and Female Juncos: A Role for Hormones in Mediating Sexual Dimorphism and Conflict 
PLoS ONE  2013;8(4):e61784.
Despite sharing much of their genomes, males and females are often highly dimorphic, reflecting at least in part the resolution of sexual conflict in response to sexually antagonistic selection. Sexual dimorphism arises owing to sex differences in gene expression, and steroid hormones are often invoked as a proximate cause of sexual dimorphism. Experimental elevation of androgens can modify behavior, physiology, and gene expression, but knowledge of the role of hormones remains incomplete, including how the sexes differ in gene expression in response to hormones. We addressed these questions in a bird species with a long history of behavioral endocrinological and ecological study, the dark-eyed junco (Junco hyemalis), using a custom microarray. Focusing on two brain regions involved in sexually dimorphic behavior and regulation of hormone secretion, we identified 651 genes that differed in expression by sex in medial amygdala and 611 in hypothalamus. Additionally, we treated individuals of each sex with testosterone implants and identified many genes that may be related to previously identified phenotypic effects of testosterone treatment. Some of these genes relate to previously identified effects of testosterone-treatment and suggest that the multiple effects of testosterone may be mediated by modifying the expression of a small number of genes. Notably, testosterone-treatment tended to alter expression of different genes in each sex: only 4 of the 527 genes identified as significant in one sex or the other were significantly differentially expressed in both sexes. Hormonally regulated gene expression is a key mechanism underlying sexual dimorphism, and our study identifies specific genes that may mediate some of these processes.
PMCID: PMC3627916  PMID: 23613935
19.  Probabilistic Inference of Biochemical Reactions in Microbial Communities from Metagenomic Sequences 
PLoS Computational Biology  2013;9(3):e1002981.
Shotgun metagenomics has been applied to the studies of the functionality of various microbial communities. As a critical analysis step in these studies, biological pathways are reconstructed based on the genes predicted from metagenomic shotgun sequences. Pathway reconstruction provides insights into the functionality of a microbial community and can be used for comparing multiple microbial communities. The utilization of pathway reconstruction, however, can be jeopardized because of imperfect functional annotation of genes, and ambiguity in the assignment of predicted enzymes to biochemical reactions (e.g., some enzymes are involved in multiple biochemical reactions). Considering that metabolic functions in a microbial community are carried out by many enzymes in a collaborative manner, we present a probabilistic sampling approach to profiling functional content in a metagenomic dataset, by sampling functions of catalytically promiscuous enzymes within the context of the entire metabolic network defined by the annotated metagenome. We test our approach on metagenomic datasets from environmental and human-associated microbial communities. The results show that our approach provides a more accurate representation of the metabolic activities encoded in a metagenome, and thus improves the comparative analysis of multiple microbial communities. In addition, our approach reports likelihood scores of putative reactions, which can be used to identify important reactions and metabolic pathways that reflect the environmental adaptation of the microbial communities. Source code for sampling metabolic networks is available online at
Author Summary
We present a probabilistic sampling approach to profiling metabolic reactions in a microbial community from metagenomic shotgun reads, in an attempt to understand the metabolism within a microbial community and compare them across multiple communities. Different from the conventional pathway reconstruction approaches that aim at a definitive set of reactions, our method estimates how likely each annotated reaction can occur in the metabolism of the microbial community, given the shotgun sequencing data. This probabilistic measure improves our prediction of the actual metabolism in the microbial communities and can be used in the comparative functional analysis of metagenomic data.
PMCID: PMC3605055  PMID: 23555216
20.  On the Mutational Topology of the Bacterial Genome 
G3: Genes|Genomes|Genetics  2013;3(3):399-407.
By sequencing the genomes of 34 mutation accumulation lines of a mismatch-repair defective strain of Escherichia coli that had undergone a total of 12,750 generations, we identified 1625 spontaneous base-pair substitutions spread across the E. coli genome. These mutations are not distributed at random but, instead, fall into a wave-like spatial pattern that is repeated almost exactly in mirror image in the two separately replicated halves of the bacterial chromosome. The pattern is correlated to genomic features, with mutation densities greatest in regions predicted to have high superhelicity. Superimposed upon this pattern are regional hotspots, some of which are located where replication forks may collide or be blocked. These results suggest that, as they traverse the chromosome, the two replication forks encounter parallel structural features that change the fidelity of DNA replication.
PMCID: PMC3583449  PMID: 23450823
mutation rate; evolution; replication fidelity; chromosome structure; DNA polymerase errors
21.  The Ecoresponsive Genome of Daphnia pulex 
Science (New York, N.Y.)  2011;331(6017):555-561.
We describe the draft genome of the microcrustacean Daphnia pulex, which is only 200 Mb and contains at least 30,907 genes. The high gene count is a consequence of an elevated rate of gene duplication resulting in tandem gene clusters. More than 1/3 of Daphnia’s genes have no detectable homologs in any other available proteome, and the most amplified gene families are specific to the Daphnia lineage. The co-expansion of gene families interacting within metabolic pathways suggests that the maintenance of duplicated genes is not random, and the analysis of gene expression under different environmental conditions reveals that numerous paralogs acquire divergent expression patterns soon after duplication. Daphnia-specific genes – including many additional loci within sequenced regions that are otherwise devoid of annotations – are the most responsive genes to ecological challenges.
PMCID: PMC3529199  PMID: 21292972
23.  Investigation of VUV Photodissociation Propensities Using Peptide Libraries 
PSD does not usually generate a complete series of y-type ions, particularly at high mass, and this is a limitation for de novo sequencing algorithms. It is demonstrated that b2 and b3 ions can be used to help assign high mass xN-2 and xN-3 fragments that are found in vacuum ultraviolet (VUV) photofragmentation experiments. In addition, vN-type ion fragments with side chain loss from the N-terminal residue often enable confirmation of N-terminal amino acids. Libraries containing several thousand peptides were examined using photodissociation in a MALDI-TOF/TOF instrument. 1345 photodissociation spectra with a high S/N ratio were interpreted.
PMCID: PMC3224043  PMID: 22125417
De Novo Sequencing; Photodissociation; Mass Spectrometry; MALDI-TOF/TOF
24.  De novo transcriptome sequencing in a songbird, the dark-eyed junco (Junco hyemalis): genomic tools for an ecological model system 
BMC Genomics  2012;13:305.
Though genomic-level data are becoming widely available, many of the metazoan species sequenced are laboratory systems whose natural history is not well documented. In contrast, the wide array of species with very well-characterized natural history have, until recently, lacked genomics tools. It is now possible to address significant evolutionary genomics questions by applying high-throughput sequencing to discover the majority of genes for ecologically tractable species, and by subsequently developing microarray platforms from which to investigate gene regulatory networks that function in natural systems. We used GS-FLX Titanium Sequencing (Roche/454-Sequencing) of two normalized libraries of pooled RNA samples to characterize a transcriptome of the dark-eyed junco (Junco hyemalis), a North American sparrow that is a classically studied species in the fields of photoperiodism, speciation, and hormone-mediated behavior.
From a broad pool of RNA sampled from tissues throughout the body of a male and a female junco, we sequenced a total of 434 million nucleotides from 1.17 million reads that were assembled de novo into 31,379 putative transcripts representing 22,765 gene sets covering 35.8 million nucleotides with 12-fold average depth of coverage. Annotation of roughly half of the putative genes was accomplished using sequence similarity, and expression was confirmed for the majority with a preliminary microarray analysis. Of 716 core bilaterian genes, 646 (90 %) were recovered within our characterized gene set. Gene Ontology, orthoDB orthology groups, and KEGG Pathway annotation provide further functional information about the sequences, and 25,781 potential SNPs were identified.
The extensive sequence information returned by this effort adds to the growing store of genomic data on diverse species. The extent of coverage and annotation achieved and confirmation of expression, show that transcriptome sequencing provides useful information for ecological model systems that have historically lacked genomic tools. The junco-specific microarray developed here is allowing investigations of gene expression responses to environmental and hormonal manipulations – extending the historic work on natural history and hormone-mediated phenotypes in this system.
PMCID: PMC3476391  PMID: 22776250
Transcriptome; Aves; pyrosequencing; microarray; Junco; 454 titanium cDNA sequencing; single nucleotide polymorphism.
25.  Diverse CRISPRs Evolving in Human Microbiomes 
PLoS Genetics  2012;8(6):e1002441.
CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) loci, together with cas (CRISPR–associated) genes, form the CRISPR/Cas adaptive immune system, a primary defense strategy that eubacteria and archaea mobilize against foreign nucleic acids, including phages and conjugative plasmids. Short spacer sequences separated by the repeats are derived from foreign DNA and direct interference to future infections. The availability of hundreds of shotgun metagenomic datasets from the Human Microbiome Project (HMP) enables us to explore the distribution and diversity of known CRISPRs in human-associated microbial communities and to discover new CRISPRs. We propose a targeted assembly strategy to reconstruct CRISPR arrays, which whole-metagenome assemblies fail to identify. For each known CRISPR type (identified from reference genomes), we use its direct repeat consensus sequence to recruit reads from each HMP dataset and then assemble the recruited reads into CRISPR loci; the unique spacer sequences can then be extracted for analysis. We also identified novel CRISPRs or new CRISPR variants in contigs from whole-metagenome assemblies and used targeted assembly to more comprehensively identify these CRISPRs across samples. We observed that the distributions of CRISPRs (including 64 known and 86 novel ones) are largely body-site specific. We provide detailed analysis of several CRISPR loci, including novel CRISPRs. For example, known streptococcal CRISPRs were identified in most oral microbiomes, totaling ∼8,000 unique spacers: samples resampled from the same individual and oral site shared the most spacers; different oral sites from the same individual shared significantly fewer, while different individuals had almost no common spacers, indicating the impact of subtle niche differences on the evolution of CRISPR defenses. We further demonstrate potential applications of CRISPRs to the tracing of rare species and the virus exposure of individuals. This work indicates the importance of effective identification and characterization of CRISPR loci to the study of the dynamic ecology of microbiomes.
Author Summary
Human bodies are complex ecological systems in which various microbial organisms and viruses interact with each other and with the human host. The Human Microbiome Project (HMP) has resulted in >700 datasets of shotgun metagenomic sequences, from which we can learn about the compositions and functions of human-associated microbial communities. CRISPR/Cas systems are a widespread class of adaptive immune systems in bacteria and archaea, providing acquired immunity against foreign nucleic acids: CRISPR/Cas defense pathways involve integration of viral- or plasmid-derived DNA segments into CRISPR arrays (forming spacers between repeated structural sequences), and expression of short crRNAs from these single repeat-spacer units, to generate interference to future invading foreign genomes. Powered by an effective computational approach (the targeted assembly approach for CRISPR), our analysis of CRISPR arrays in the HMP datasets provides the very first global view of bacterial immunity systems in human-associated microbial communities. The great diversity of CRISPR spacers we observed among different body sites, in different individuals, and in single individuals over time, indicates the impact of subtle niche differences on the evolution of CRISPR defenses and indicates the key role of bacteriophage (and plasmids) in shaping human microbial communities.
PMCID: PMC3374615  PMID: 22719260

Results 1-25 (47)