1.  Using the PfEMP1 Head Structure Binding Motif to Deal a Blow at Severe Malaria 
PLoS ONE  2014;9(2):e88420.
Plasmodium falciparum (Pf) malaria causes 200 million cases worldwide, 8 million being severe and complicated leading to ∼1 million deaths and ∼100,000 abortions annually. Plasmodium falciparum erythrocyte membrane protein 1 (PfEMP1) has been implicated in cytoadherence and infected erythrocyte rosette formation, associated with cerebral malaria; chondroitin sulphate-A attachment and infected erythrocyte sequestration related to pregnancy-associated malaria and other severe forms of disease. An endothelial cell high activity binding peptide is described in several of this ∼300 kDa hypervariable protein’s domains displaying a conserved motif (GACxPxRRxxLC); it established H-bonds with other binding peptides to mediate red blood cell group A and chondroitin sulphate attachment. This motif (when properly modified) induced PfEMP1-specific strain-transcending, fully-protective immunity for the first time in experimental challenge in Aotus monkeys, opening the way forward for a long sought-after vaccine against severe malaria.
PMCID: PMC3917906  PMID: 24516657
2.  Systems and Evolutionary Characterization of MicroRNAs and Their Underlying Regulatory Networks in Soybean Cotyledons 
PLoS ONE  2014;9(1):e86153.
MicroRNAs (miRNAs) are an emerging class of small RNAs regulating a wide range of biological processes. Soybean cotyledons evolved as sink tissues to synthesize and store seed reserves which directly affect soybean seed yield and quality. However, little is known about miRNAs and their regulatory networks in soybean cotyledons. We sequenced 292 million small RNA reads expressed in soybean cotyledons, and discovered 130 novel miRNA genes and 72 novel miRNA families. The cotyledon miRNAs arose at various stages of land plant evolution. Evolutionary analysis of the miRNA genes in duplicated genome segments from the recent Glycine whole genome duplication revealed that the majority of novel soybean cotyledon miRNAs were young, and likely arose after the duplication event 13 million years ago. We revealed the evolutionary pathway of a soybean cotyledon miRNA family (soy-miR15/49) that evolved from a neutral invertase gene through an inverted duplication and a series of DNA amplification and deletion events. A total of 304 miRNA genes were expressed in soybean cotyledons. The miRNAs were predicted to target 1910 genes, and form complex miRNA networks regulating a wide range of biological pathways in cotyledons. The comprehensive characterization of the miRNAs and their underlying regulatory networks at gene, pathway and system levels provides a foundation for further studies of miRNAs in cotyledons.
PMCID: PMC3903507  PMID: 24475082
3.  On the Value of Intra-Motif Dependencies of Human Insulator Protein CTCF 
PLoS ONE  2014;9(1):e85629.
The binding affinity of DNA-binding proteins such as transcription factors is mainly determined by the base composition of the corresponding binding site on the DNA strand. Most proteins do not bind only a single sequence, but rather a set of sequences, which may be modeled by a sequence motif. Algorithms for de novo motif discovery differ in their promoter models, learning approaches, and other aspects, but typically use the statistically simple position weight matrix model for the motif, which assumes statistical independence among all nucleotides. However, there is no clear justification for that assumption, leading to an ongoing debate about the importance of modeling dependencies between nucleotides within binding sites. In the past, modeling statistical dependencies within binding sites has been hampered by the problem of limited data. With the rise of high-throughput technologies such as ChIP-seq, this situation has now changed, making it possible to make use of statistical dependencies effectively. In this work, we investigate the presence of statistical dependencies in binding sites of the human enhancer-blocking insulator protein CTCF by using the recently developed model class of inhomogeneous parsimonious Markov models, which is capable of modeling complex dependencies while avoiding overfitting. These findings lead to a more detailed characterization of the CTCF binding motif, which is only poorly represented by independent nucleotide frequencies at several positions, predominantly at the 3′ end.
PMCID: PMC3899044  PMID: 24465627
4.  SSW Library: An SIMD Smith-Waterman C/C++ Library for Use in Genomic Applications 
PLoS ONE  2013;8(12):e82138.
The Smith-Waterman algorithm, which produces the optimal pairwise alignment between two sequences, is frequently used as a key component of fast heuristic read mapping and variation detection tools for next-generation sequencing data. Though various fast Smith-Waterman implementations are developed, they are either designed as monolithic protein database searching tools, which do not return detailed alignment, or are embedded into other tools. These issues make reusing these efficient Smith-Waterman implementations impractical.
To facilitate easy integration of the fast Single-Instruction-Multiple-Data Smith-Waterman algorithm into third-party software, we wrote a C/C++ library, which extends Farrar’s Striped Smith-Waterman (SSW) to return alignment information in addition to the optimal Smith-Waterman score. In this library we developed a new method to generate the full optimal alignment results and a suboptimal score in linear space at little cost of efficiency. This improvement makes the fast Single-Instruction-Multiple-Data Smith-Waterman become really useful in genomic applications. SSW is available both as a C/C++ software library, as well as a stand-alone alignment tool at:
The SSW library has been used in the primary read mapping tool MOSAIK, the split-read mapping program SCISSORS, the MEI detector TANGRAM, and the read-overlap graph generation program RZMBLR. The speeds of the mentioned software are improved significantly by replacing their ordinary Smith-Waterman or banded Smith-Waterman module with the SSW Library.
PMCID: PMC3852983  PMID: 24324759
5.  Transcriptional Activity, Chromosomal Distribution and Expression Effects of Transposable Elements in Coffea Genomes 
PLoS ONE  2013;8(11):e78931.
Plant genomes are massively invaded by transposable elements (TEs), many of which are located near host genes and can thus impact gene expression. In flowering plants, TE expression can be activated (de-repressed) under certain stressful conditions, both biotic and abiotic, as well as by genome stress caused by hybridization. In this study, we examined the effects of these stress agents on TE expression in two diploid species of coffee, Coffea canephora and C. eugenioides, and their allotetraploid hybrid C. arabica. We also explored the relationship of TE repression mechanisms to host gene regulation via the effects of exonized TE sequences. Similar to what has been seen for other plants, overall TE expression levels are low in Coffea plant cultivars, consistent with the existence of effective TE repression mechanisms. TE expression patterns are highly dynamic across the species and conditions assayed here are unrelated to their classification at the level of TE class or family. In contrast to previous results, cell culture conditions per se do not lead to the de-repression of TE expression in C. arabica. Results obtained here indicate that differing plant drought stress levels relate strongly to TE repression mechanisms. TEs tend to be expressed at significantly higher levels in non-irrigated samples for the drought tolerant cultivars but in drought sensitive cultivars the opposite pattern was shown with irrigated samples showing significantly higher TE expression. Thus, TE genome repression mechanisms may be finely tuned to the ideal growth and/or regulatory conditions of the specific plant cultivars in which they are active. Analysis of TE expression levels in cell culture conditions underscored the importance of nonsense-mediated mRNA decay (NMD) pathways in the repression of Coffea TEs. These same NMD mechanisms can also regulate plant host gene expression via the repression of genes that bear exonized TE sequences.
PMCID: PMC3823963  PMID: 24244387
6.  DNA Methylation Profiling of the Fibrinogen Gene Landscape in Human Cells and during Mouse and Zebrafish Development 
PLoS ONE  2013;8(8):e73089.
The fibrinogen genes FGA, FGB and FGG show coordinated expression in hepatocytes. Understanding the underlying transcriptional regulation may elucidate how their tissue-specific expression is maintained and explain the high variability in fibrinogen blood levels. DNA methylation of CpG-poor gene promoters is dynamic with low methylation correlating with tissue-specific gene expression but its direct effect on gene regulation as well as implications of non-promoter CpG methylation are not clear. Here we compared methylation of CpG sites throughout the fibrinogen gene cluster in human cells and mouse and zebrafish tissues. We observed low DNA methylation of the CpG-poor fibrinogen promoters and of additional regulatory elements (the liver enhancers CNC12 and PFE2) in fibrinogen-expressing samples. In a gene reporter assay, CpG-methylation in the FGA promoter reduced promoter activity, suggesting a repressive function for DNA methylation in the fibrinogen locus. In mouse and zebrafish livers we measured reductions in DNA methylation around fibrinogen genes during development that were preceded by increased fibrinogen expression and tri-methylation of Histone3 lysine4 (H3K4me3) in fibrinogen promoters. Our data support a model where changes in hepatic transcription factor expression and histone modification provide the switch for increased fibrinogen gene expression in the developing liver which is followed by reduction of CpG methylation.
PMCID: PMC3749180  PMID: 23991173
7.  High-Throughput Sequencing and Degradome Analysis Identify miRNAs and Their Targets Involved in Fruit Senescence of Fragaria ananassa 
PLoS ONE  2013;8(8):e70959.
In non-climacteric fruits, the respiratory increase is absent and no phytohormone is appearing to be critical for their ripening process. They must remain on the parent plant to enable full ripening and be picked at or near the fully ripe stage to obtain the best eating quality. However, huge losses often occur for their quick post-harvest senescence. To understanding the complex mechanism of non-climacteric fruits post-harvest senescence, we constructed two small RNA libraries and one degradome from strawberry fruit stored at 20°C for 0 and 24 h. A total of 88 known and 1224 new candidatemiRNAs, and 103 targets cleaved by 19 known miRNAs families and 55 new candidatemiRNAs were obtained. These targets were associated with development, metabolism, defense response, signaling transduction and transcriptional regulation. Among them, 14 targets, including NAC transcription factor, Auxin response factors (ARF) and Myb transcription factors, cleaved by 6 known miRNA families and 6 predicted candidates, were found to be involved in regulating fruit senescence. The present study provided valuable information for understanding the quick senescence of strawberry fruit, and offered a foundation for studying the miRNA-mediated senescence of non-climacteric fruits.
PMCID: PMC3747199  PMID: 23990918
8.  HMGN1 Modulates Nucleosome Occupancy and DNase I Hypersensitivity at the CpG Island Promoters of Embryonic Stem Cells 
Molecular and Cellular Biology  2013;33(16):3377-3389.
Chromatin structure plays a key role in regulating gene expression and embryonic differentiation; however, the factors that determine the organization of chromatin around regulatory sites are not fully known. Here we show that HMGN1, a nucleosome-binding protein ubiquitously expressed in vertebrate cells, preferentially binds to CpG island-containing promoters and affects the organization of nucleosomes, DNase I hypersensitivity, and the transcriptional profile of mouse embryonic stem cells and neural progenitors. Loss of HMGN1 alters the organization of an unstable nucleosome at transcription start sites, reduces the number of DNase I-hypersensitive sites genome wide, and decreases the number of nestin-positive neural progenitors in the subventricular zone (SVZ) region of mouse brain. Thus, architectural chromatin-binding proteins affect the transcription profile and chromatin structure during embryonic stem cell differentiation.
PMCID: PMC3753902  PMID: 23775126
9.  Chromatin Accessibility Data Sets Show Bias Due to Sequence Specificity of the DNase I Enzyme 
PLoS ONE  2013;8(7):e69853.
DNase I is an enzyme which cuts duplex DNA at a rate that depends strongly upon its chromatin environment. In combination with high-throughput sequencing (HTS) technology, it can be used to infer genome-wide landscapes of open chromatin regions. Using this technology, systematic identification of hundreds of thousands of DNase I hypersensitive sites (DHS) per cell type has been possible, and this in turn has helped to precisely delineate genomic regulatory compartments. However, to date there has been relatively little investigation into possible biases affecting this data.
We report a significant degree of sequence preference spanning sites cut by DNase I in a number of published data sets. The two major protocols in current use each show a different pattern, but for a given protocol the pattern of sequence specificity seems to be quite consistent. The patterns are substantially different from biases seen in other types of HTS data sets, and in some cases the most constrained position lies outside the sequenced fragment, implying that this constraint must relate to the digestion process rather than events occurring during library preparation or sequencing.
DNase I is a sequence-specific enzyme, with a specificity that may depend on experimental conditions. This sequence specificity is not taken into account by existing pipelines for identifying open chromatin regions. Care must be taken when interpreting DNase I results, especially when looking at the precise locations of the reads. Future studies may be able to improve the sensitivity and precision of chromatin state measurement by compensating for sequence bias.
PMCID: PMC3724795  PMID: 23922824
10.  Unique and Conserved MicroRNAs in Wheat Chromosome 5D Revealed by Next-Generation Sequencing 
PLoS ONE  2013;8(7):e69801.
MicroRNAs are a class of short, non-coding, single-stranded RNAs that act as post-transcriptional regulators in gene expression. miRNA analysis of Triticum aestivum chromosome 5D was performed on 454 GS FLX Titanium sequences of flow-sorted chromosome 5D with a total of 3,208,630 good quality reads representing 1.34x and 1.61x coverage of the short (5DS) and long (5DL) arms of the chromosome respectively. In silico and structural analyses revealed a total of 55 miRNAs; 48 and 42 miRNAs were found to be present on 5DL and 5DS respectively, of which 35 were common to both chromosome arms, while 13 miRNAs were specific to 5DL and 7 miRNAs were specific to 5DS. In total, 14 of the predicted miRNAs were identified in wheat for the first time. Representation (the copy number of each miRNA) was also found to be higher in 5DL (1,949) compared to 5DS (1,191). Targets were predicted for each miRNA, while expression analysis gave evidence of expression for 6 out of 55 miRNAs. Occurrences of the same miRNAs were also found in Brachypodium distachyon and Oryza sativa genome sequences to identify syntenic miRNA coding sequences. Based on this analysis, two other miRNAs: miR1133 and miR167 were detected in B. distachyon syntenic region of wheat 5DS. Five of the predicted miRNA coding regions (miR6220, miR5070, miR169, miR5085, miR2118) were experimentally verified to be located to the 5D chromosome and three of them : miR2118, miR169 and miR5085, were shown to be 5D specific. Furthermore miR2118 was shown to be expressed in Chinese Spring adult leaves. miRNA genes identified in this study will expand our understanding of gene regulation in bread wheat.
PMCID: PMC3720673  PMID: 23936103
11.  Identification of Immunity Related Genes to Study the Physalis peruviana – Fusarium oxysporum Pathosystem 
PLoS ONE  2013;8(7):e68500.
The Cape gooseberry (Physalisperuviana L) is an Andean exotic fruit with high nutritional value and appealing medicinal properties. However, its cultivation faces important phytosanitary problems mainly due to pathogens like Fusarium oxysporum, Cercosporaphysalidis and Alternaria spp. Here we used the Cape gooseberry foliar transcriptome to search for proteins that encode conserved domains related to plant immunity including: NBS (Nucleotide Binding Site), CC (Coiled-Coil), TIR (Toll/Interleukin-1 Receptor). We identified 74 immunity related gene candidates in P. peruviana which have the typical resistance gene (R-gene) architecture, 17 Receptor like kinase (RLKs) candidates related to PAMP-Triggered Immunity (PTI), eight (TIR-NBS-LRR, or TNL) and nine (CC–NBS-LRR, or CNL) candidates related to Effector-Triggered Immunity (ETI) genes among others. These candidate genes were categorized by molecular function (98%), biological process (85%) and cellular component (79%) using gene ontology. Some of the most interesting predicted roles were those associated with binding and transferase activity. We designed 94 primers pairs from the 74 immunity-related genes (IRGs) to amplify the corresponding genomic regions on six genotypes that included resistant and susceptible materials. From these, we selected 17 single band amplicons and sequenced them in 14 F. oxysporum resistant and susceptible genotypes. Sequence polymorphisms were analyzed through preliminary candidate gene association, which allowed the detection of one SNP at the PpIRG-63 marker revealing a nonsynonymous mutation in the predicted LRR domain suggesting functional roles for resistance.
PMCID: PMC3701084  PMID: 23844210
12.  Genome Sequences for Six Rhodanobacter Strains, Isolated from Soils and the Terrestrial Subsurface, with Variable Denitrification Capabilities 
Journal of Bacteriology  2012;194(16):4461-4462.
We report the first genome sequences for six strains of Rhodanobacter species isolated from a variety of soil and subsurface environments. Three of these strains are capable of complete denitrification and three others are not. However, all six strains contain most of the genes required for the respiration of nitrate to gaseous nitrogen. The nondenitrifying members of the genus lack only the gene for nitrate reduction, the first step in the full denitrification pathway. The data suggest that the environmental role of bacteria from the genus Rhodanobacter should be reevaluated.
PMCID: PMC3416251  PMID: 22843592
13.  HeurAA: Accurate and Fast Detection of Genetic Variations with a Novel Heuristic Amplicon Aligner Program for Next Generation Sequencing 
PLoS ONE  2013;8(1):e54294.
Next generation sequencing (NGS) of PCR amplicons is a standard approach to detect genetic variations in personalized medicine such as cancer diagnostics. Computer programs used in the NGS community often miss insertions and deletions (indels) that constitute a large part of known human mutations. We have developed HeurAA, an open source, heuristic amplicon aligner program. We tested the program on simulated datasets as well as experimental data from multiplex sequencing of 40 amplicons in 12 oncogenes collected on a 454 Genome Sequencer from lung cancer cell lines. We found that HeurAA can accurately detect all indels, and is more than an order of magnitude faster than previous programs. HeurAA can compare reads and reference sequences up to several thousand base pairs in length, and it can evaluate data from complex mixtures containing reads of different gene-segments from different samples. HeurAA is written in C and Perl for Linux operating systems, the code and the documentation are available for research applications at
PMCID: PMC3548894  PMID: 23349847
14.  Next Generation Sequencing Based Transcriptome Analysis of Septic-Injury Responsive Genes in the Beetle Tribolium castaneum 
PLoS ONE  2013;8(1):e52004.
Beetles (Coleoptera) are the most diverse animal group on earth and interact with numerous symbiotic or pathogenic microbes in their environments. The red flour beetle Tribolium castaneum is a genetically tractable model beetle species and its whole genome sequence has recently been determined. To advance our understanding of the molecular basis of beetle immunity here we analyzed the whole transcriptome of T. castaneum by high-throughput next generation sequencing technology. Here, we demonstrate that the Illumina/Solexa sequencing approach of cDNA samples from T. castaneum including over 9.7 million reads with 72 base pairs (bp) length (approximately 700 million bp sequence information with about 30× transcriptome coverage) confirms the expression of most predicted genes and enabled subsequent qualitative and quantitative transcriptome analysis. This approach recapitulates our recent quantitative real-time PCR studies of immune-challenged and naïve T. castaneum beetles, validating our approach. Furthermore, this sequencing analysis resulted in the identification of 73 differentially expressed genes upon immune-challenge with statistical significance by comparing expression data to calculated values derived by fitting to generalized linear models. We identified up regulation of diverse immune-related genes (e.g. Toll receptor, serine proteinases, DOPA decarboxylase and thaumatin) and of numerous genes encoding proteins with yet unknown functions. Of note, septic-injury resulted also in the elevated expression of genes encoding heat-shock proteins or cytochrome P450s supporting the view that there is crosstalk between immune and stress responses in T. castaneum. The present study provides a first comprehensive overview of septic-injury responsive genes in T. castaneum beetles. Identified genes advance our understanding of T. castaneum specific gene expression alteration upon immune-challenge in particular and may help to understand beetle immunity in general.
PMCID: PMC3541394  PMID: 23326321
15.  Development of Transcriptomic Resources for Interrogating the Biosynthesis of Monoterpene Indole Alkaloids in Medicinal Plant Species 
PLoS ONE  2012;7(12):e52506.
The natural diversity of plant metabolism has long been a source for human medicines. One group of plant-derived compounds, the monoterpene indole alkaloids (MIAs), includes well-documented therapeutic agents used in the treatment of cancer (vinblastine, vincristine, camptothecin), hypertension (reserpine, ajmalicine), malaria (quinine), and as analgesics (7-hydroxymitragynine). Our understanding of the biochemical pathways that synthesize these commercially relevant compounds is incomplete due in part to a lack of molecular, genetic, and genomic resources for the identification of the genes involved in these specialized metabolic pathways. To address these limitations, we generated large-scale transcriptome sequence and expression profiles for three species of Asterids that produce medicinally important MIAs: Camptotheca acuminata, Catharanthus roseus, and Rauvolfia serpentina. Using next generation sequencing technology, we sampled the transcriptomes of these species across a diverse set of developmental tissues, and in the case of C. roseus, in cultured cells and roots following elicitor treatment. Through an iterative assembly process, we generated robust transcriptome assemblies for all three species with a substantial number of the assembled transcripts being full or near-full length. The majority of transcripts had a related sequence in either UniRef100, the Arabidopsis thaliana predicted proteome, or the Pfam protein domain database; however, we also identified transcripts that lacked similarity with entries in either database and thereby lack a known function. Representation of known genes within the MIA biosynthetic pathway was robust. As a diverse set of tissues and treatments were surveyed, expression abundances of transcripts in the three species could be estimated to reveal transcripts associated with development and response to elicitor treatment. Together, these transcriptomes and expression abundance matrices provide a rich resource for understanding plant specialized metabolism, and promotes realization of innovative production systems for plant-derived pharmaceuticals.
PMCID: PMC3530497  PMID: 23300689
16.  Identification and Comparative Analysis of ncRNAs in Human, Mouse and Zebrafish Indicate a Conserved Role in Regulation of Genes Expressed in Brain 
PLoS ONE  2012;7(12):e52275.
ncRNAs (non-coding RNAs), in particular long ncRNAs, represent a significant proportion of the vertebrate transcriptome and probably regulate many biological processes. We used publically available ESTs (Expressed Sequence Tags) from human, mouse and zebrafish and a previously published analysis pipeline to annotate and analyze the vertebrate non-protein-coding transcriptome. Comparative analysis confirmed some previously described features of intergenic ncRNAs, such as a positionally biased distribution with respect to regulatory or development related protein-coding genes, and weak but clear sequence conservation across species. Significantly, comparative analysis of developmental and regulatory genes proximate to long ncRNAs indicated that the only conserved relationship of these genes to neighbor long ncRNAs was with respect to genes expressed in human brain, suggesting a conserved, ncRNA cis-regulatory network in vertebrate nervous system development. Most of the relationships between long ncRNAs and proximate coding genes were not conserved, providing evidence for the rapid evolution of species-specific gene associated long ncRNAs. We have reconstructed and annotated over 130,000 long ncRNAs in these three species, providing a significantly expanded number of candidates for functional testing by the research community.
PMCID: PMC3527520  PMID: 23284966
17.  A Global Genome Segmentation Method for Exploration of Epigenetic Patterns 
PLoS ONE  2012;7(10):e46811.
Current genome-wide ChIP-seq experiments on different epigenetic marks aim at unraveling the interplay between their regulation mechanisms. Published evaluation tools, however, allow testing for predefined hypotheses only. Here, we present a novel method for annotation-independent exploration of epigenetic data and their inter-correlation with other genome-wide features. Our method is based on a combinatorial genome segmentation solely using information on combinations of epigenetic marks. It does not require prior knowledge about the data (e.g. gene positions), but allows integrating the data in a straightforward manner. Thereby, it combines compression, clustering and visualization of the data in a single tool. Our method provides intuitive maps of epigenetic patterns across multiple levels of organization, e.g. of the co-occurrence of different epigenetic marks in different cell types. Thus, it facilitates the formulation of new hypotheses on the principles of epigenetic regulation. We apply our method to histone modification data on trimethylation of histone H3 at lysine 4, 9 and 27 in multi-potent and lineage-primed mouse cells, analyzing their combinatorial modification pattern as well as differentiation-related changes of single modifications. We demonstrate that our method is capable of reproducing recent findings of gene centered approaches, e.g. correlations between CpG-density and the analyzed histone modifications. Moreover, combining the clustered epigenetic data with information on the expression status of associated genes we classify differences in epigenetic status of e.g. house-keeping genes versus differentiation-related genes. Visualizing the distribution of modification states on the chromosomes, we discover strong patterns for chromosome X. For example, exclusively H3K9me3 marked segments are enriched, while poised and active states are rare. Hence, our method also provides new insights into chromosome-specific epigenetic patterns, opening up new questions how “epigenetic computation” is distributed over the genome in space and time.
PMCID: PMC3470578  PMID: 23077526
18.  In silico identification and characterization of the ion transport specificity for P-type ATPases in the Mycobacterium tuberculosis complex 
P-type ATPases hydrolyze ATP and release energy that is used in the transport of ions against electrochemical gradients across plasma membranes, making these proteins essential for cell viability. Currently, the distribution and function of these ion transporters in mycobacteria are poorly understood.
In this study, probabilistic profiles were constructed based on hidden Markov models to identify and classify P-type ATPases in the Mycobacterium tuberculosis complex (MTBC) according to the type of ion transported across the plasma membrane. Topology, hydrophobicity profiles and conserved motifs were analyzed to correlate amino acid sequences of P-type ATPases and ion transport specificity. Twelve candidate P-type ATPases annotated in the M. tuberculosis H37Rv proteome were identified in all members of the MTBC, and probabilistic profiles classified them into one of the following three groups: heavy metal cation transporters, alkaline and alkaline earth metal cation transporters, and the beta subunit of a prokaryotic potassium pump. Interestingly, counterparts of the non-catalytic beta subunits of Hydrogen/Potassium and Sodium/Potassium P-type ATPases were not found.
The high content of heavy metal transporters found in the MTBC suggests that they could play an important role in the ability of M. tuberculosis to survive inside macrophages, where tubercle bacilli face high levels of toxic metals. Finally, the results obtained in this work provide a starting point for experimental studies that may elucidate the ion specificity of the MTBC P-type ATPases and their role in mycobacterial infections.
PMCID: PMC3573892  PMID: 23031689
Tuberculosis; Mycobacterium tuberculosis complex; P-type ATPases; Ion transport; Conserved motifs
19.  Differences in local genomic context of bound and unbound motifs 
Gene  2012;506(1):125-134.
Understanding gene regulation is a major objective in molecular biology research. Frequently, transcription is driven by transcription factors (TFs) that bind to specific DNA sequences. These motifs are usually short and degenerate, rendering the likelihood of multiple copies occurring throughout the genome due to random chance as high. Despite this, TFs only bind to a small subset of sites, thus prompting our investigation into the differences between motifs that are bound by TFs and those that remain unbound. Here we constructed vectors representing various chromatin- and sequence-based features for a published set of bound and unbound motifs representing nine TFs in the budding yeast Saccharomyces cerevisiae. Using a machine learning approach, we identified a set of features that can be used to discriminate between bound and unbound motifs. We also discovered that some TFs bind most or all of their strong motifs in intergenic regions. Our data demonstrate that local sequence context can be strikingly different around motifs that are bound compared to motifs that are unbound. We concluded that there are multiple combinations of genomic features that characterize bound or unbound motifs.
PMCID: PMC3412921  PMID: 22692006
Gene regulation; yeast; transcription factors; genomic features; machine learning
20.  Bovine ncRNAs Are Abundant, Primarily Intergenic, Conserved and Associated with Regulatory Genes 
PLoS ONE  2012;7(8):e42638.
It is apparent that non-coding transcripts are a common feature of higher organisms and encode uncharacterized layers of genetic regulation and information. We used public bovine EST data from many developmental stages and tissues, and developed a pipeline for the genome wide identification and annotation of non-coding RNAs (ncRNAs). We have predicted 23,060 bovine ncRNAs, 99% of which are un-annotated, based on known ncRNA databases. Intergenic transcripts accounted for the majority (57%) of the predicted ncRNAs and the occurrence of ncRNAs and genes were only moderately correlated (r = 0.55, p-value<2.2e-16). Many of these intergenic non-coding RNAs mapped close to the 3′ or 5′ end of thousands of genes and many of these were transcribed from the opposite strand with respect to the closest gene, particularly regulatory-related genes. Conservation analyses showed that these ncRNAs were evolutionarily conserved, and many intergenic ncRNAs proximate to genes contained sequence-specific motifs. Correlation analysis of expression between these intergenic ncRNAs and protein-coding genes using RNA-seq data from a variety of tissues showed significant correlations with many transcripts. These results support the hypothesis that ncRNAs are common, transcribed in a regulated fashion and have regulatory functions.
PMCID: PMC3412814  PMID: 22880061
21.  ChIPnorm: A Statistical Method for Normalizing and Identifying Differential Regions in Histone Modification ChIP-seq Libraries 
PLoS ONE  2012;7(8):e39573.
The advent of high-throughput technologies such as ChIP-seq has made possible the study of histone modifications. A problem of particular interest is the identification of regions of the genome where different cell types from the same organism exhibit different patterns of histone enrichment. This problem turns out to be surprisingly difficult, even in simple pairwise comparisons, because of the significant level of noise in ChIP-seq data. In this paper we propose a two-stage statistical method, called ChIPnorm, to normalize ChIP-seq data, and to find differential regions in the genome, given two libraries of histone modifications of different cell types. We show that the ChIPnorm method removes most of the noise and bias in the data and outperforms other normalization methods. We correlate the histone marks with gene expression data and confirm that histone modifications H3K27me3 and H3K4me3 act as respectively a repressor and an activator of genes. Compared to what was previously reported in the literature, we find that a substantially higher fraction of bivalent marks in ES cells for H3K27me3 and H3K4me3 move into a K27-only state. We find that most of the promoter regions in protein-coding genes have differential histone-modification sites. The software for this work can be downloaded from
PMCID: PMC3411705  PMID: 22870189
22.  The Practical Evaluation of DNA Barcode Efficacy* 
This chapter describes a workflow for measuring the efficacy of a barcode in identifying species. First, assemble individual sequence databases corresponding to each barcode marker. A controlled collection of taxonomic data is preferable to GenBank data, because GenBank data can be problematic, particularly when comparing barcodes based on more than one marker. To ensure proper controls when evaluating species identification, specimens not having a sequence in every marker database should be discarded. Second, select a computer algorithm for assigning species to barcode sequences. No algorithm has yet improved notably on assigning a specimen to the species of its nearest neighbor within a barcode database. Because global sequence alignments (e.g., with the Needleman–Wunsch algorithm, or some related algorithm) examine entire barcode sequences, they generally produce better species assignments than local sequence alignments (e.g., with BLAST). No neighboring method (e.g., global sequence similarity, global sequence distance, or evolutionary distance based on a global alignment) has yet shown a notable superiority in identifying species. Finally, “the probability of correct identification” (PCI) provides an appropriate measurement of barcode efficacy. The overall PCI for a data set is the average of the species PCIs, taken over all species in the data set. This chapter states explicitly how to calculate PCI, how to estimate its statistical sampling error, and how to use data on PCR failure to set limits on how much improvements in PCR technology can improve species identification.
PMCID: PMC3410705  PMID: 22684965
Barcode efficacy in species identification; Probability of correct identification; DNA barcode
23.  Bivalent-Like Chromatin Markers Are Predictive for Transcription Start Site Distribution in Human 
PLoS ONE  2012;7(6):e38112.
Deep sequencing of 5′ capped transcripts has revealed a variety of transcription initiation patterns, from narrow, focused promoters to wide, broad promoters. Attempts have already been made to model empirically classified patterns, but virtually no quantitative models for transcription initiation have been reported. Even though both genetic and epigenetic elements have been associated with such patterns, the organization of regulatory elements is largely unknown. Here, linear regression models were derived from a pool of regulatory elements, including genomic DNA features, nucleosome organization, and histone modifications, to predict the distribution of transcription start sites (TSS). Importantly, models including both active and repressive histone modification markers, e.g. H3K4me3 and H4K20me1, were consistently found to be much more predictive than models with only single-type histone modification markers, indicating the possibility of “bivalent-like” epigenetic control of transcription initiation. The nucleosome positions are proposed to be coded in the active component of such bivalent-like histone modification markers. Finally, we demonstrated that models trained on one cell type could successfully predict TSS distribution in other cell types, suggesting that these models may have a broader application range.
PMCID: PMC3387189  PMID: 22768038
24.  The Physalis peruviana leaf transcriptome: assembly, annotation and gene model prediction 
BMC Genomics  2012;13:151.
Physalis peruviana commonly known as Cape gooseberry is a member of the Solanaceae family that has an increasing popularity due to its nutritional and medicinal values. A broad range of genomic tools is available for other Solanaceae, including tomato and potato. However, limited genomic resources are currently available for Cape gooseberry.
We report the generation of a total of 652,614 P. peruviana Expressed Sequence Tags (ESTs), using 454 GS FLX Titanium technology. ESTs, with an average length of 371 bp, were obtained from a normalized leaf cDNA library prepared using a Colombian commercial variety. De novo assembling was performed to generate a collection of 24,014 isotigs and 110,921 singletons, with an average length of 1,638 bp and 354 bp, respectively. Functional annotation was performed using NCBI’s BLAST tools and Blast2GO, which identified putative functions for 21,191 assembled sequences, including gene families involved in all the major biological processes and molecular functions as well as defense response and amino acid metabolism pathways. Gene model predictions in P. peruviana were obtained by using the genomes of Solanum lycopersicum (tomato) and Solanum tuberosum (potato). We predict 9,436 P. peruviana sequences with multiple-exon models and conserved intron positions with respect to the potato and tomato genomes. Additionally, to study species diversity we developed 5,971 SSR markers from assembled ESTs.
We present the first comprehensive analysis of the Physalis peruviana leaf transcriptome, which will provide valuable resources for development of genetic tools in the species. Assembled transcripts with gene models could serve as potential candidates for marker discovery with a variety of applications including: functional diversity, conservation and improvement to increase productivity and fruit quality. P. peruviana was estimated to be phylogenetically branched out before the divergence of five other Solanaceae family members, S. lycopersicum, S. tuberosum, Capsicum spp, S. melongena and Petunia spp.
PMCID: PMC3488962  PMID: 22533342
P. peruviana; Solanaceae; ESTs; Functional annotation; Gene model; Phylogenetics
25.  Spatial Proximity and Similarity of the Epigenetic State of Genome Domains 
PLoS ONE  2012;7(4):e33947.
Recent studies demonstrate that the organization of the chromatin within the nuclear space might play a crucial role in the regulation of gene expression. The ongoing progress in determination of the 3D structure of the nuclear chromatin allows one to study correlations between spatial proximity of genome domains and their epigenetic state. We combined the data on three-dimensional architecture of the whole human genome with results of high-throughput studies of the chromatin functional state and observed that fragments of different chromosomes that are spatially close tend to have similar patterns of histone modifications, methylation state, DNAse sensitivity, expression level, and chromatin states in general. Moreover, clustering of genome regions by spatial proximity produced compact clusters characterized by the high level of histone modifications and DNAse sensitivity and low methylation level, and loose clusters with the opposite characteristics. We also associated the spatial proximity data with previously detected chimeric transcripts and the results of RNA-seq experiments and observed that the frequency of formation of chimeric transcripts from fragments of two different chromosomes is higher among spatially proximal genome domains. A fair fraction of these chimeric transcripts seems to arise post-transcriptionally via trans-splicing.
PMCID: PMC3319547  PMID: 22496774

