Search tips
Search criteria

Results 1-25 (78)

Clipboard (0)

Select a Filter Below

Year of Publication
more »
1.  INDIGO – INtegrated Data Warehouse of MIcrobial GenOmes with Examples from the Red Sea Extremophiles 
PLoS ONE  2013;8(12):e82210.
The next generation sequencing technologies substantially increased the throughput of microbial genome sequencing. To functionally annotate newly sequenced microbial genomes, a variety of experimental and computational methods are used. Integration of information from different sources is a powerful approach to enhance such annotation. Functional analysis of microbial genomes, necessary for downstream experiments, crucially depends on this annotation but it is hampered by the current lack of suitable information integration and exploration systems for microbial genomes.
We developed a data warehouse system (INDIGO) that enables the integration of annotations for exploration and analysis of newly sequenced microbial genomes. INDIGO offers an opportunity to construct complex queries and combine annotations from multiple sources starting from genomic sequence to protein domain, gene ontology and pathway levels. This data warehouse is aimed at being populated with information from genomes of pure cultures and uncultured single cells of Red Sea bacteria and Archaea. Currently, INDIGO contains information from Salinisphaera shabanensis, Haloplasma contractile, and Halorhabdus tiamatea - extremophiles isolated from deep-sea anoxic brine lakes of the Red Sea. We provide examples of utilizing the system to gain new insights into specific aspects on the unique lifestyle and adaptations of these organisms to extreme environments.
We developed a data warehouse system, INDIGO, which enables comprehensive integration of information from various resources to be used for annotation, exploration and analysis of microbial genomes. It will be regularly updated and extended with new genomes. It is aimed to serve as a resource dedicated to the Red Sea microbes. In addition, through INDIGO, we provide our Automatic Annotation of Microbial Genomes (AAMG) pipeline. The INDIGO web server is freely available at
PMCID: PMC3855842  PMID: 24324765
2.  Simplified Method for Predicting a Functional Class of Proteins in Transcription Factor Complexes 
PLoS ONE  2013;8(7):e68857.
Initiation of transcription is essential for most of the cellular responses to environmental conditions and for cell and tissue specificity. This process is regulated through numerous proteins, their ligands and mutual interactions, as well as interactions with DNA. The key such regulatory proteins are transcription factors (TFs) and transcription co-factors (TcoFs). TcoFs are important since they modulate the transcription initiation process through interaction with TFs. In eukaryotes, transcription requires that TFs form different protein complexes with various nuclear proteins. To better understand transcription regulation, it is important to know the functional class of proteins interacting with TFs during transcription initiation. Such information is not fully available, since not all proteins that act as TFs or TcoFs are yet annotated as such, due to generally partial functional annotation of proteins. In this study we have developed a method to predict, using only sequence composition of the interacting proteins, the functional class of human TF binding partners to be (i) TF, (ii) TcoF, or (iii) other nuclear protein. This allows for complementing the annotation of the currently known pool of nuclear proteins. Since only the knowledge of protein sequences is required in addition to protein interaction, the method should be easily applicable to many species.
Based on experimentally validated interactions between human TFs with different TFs, TcoFs and other nuclear proteins, our two classification systems (implemented as a web-based application) achieve high accuracies in distinguishing TFs and TcoFs from other nuclear proteins, and TFs from TcoFs respectively.
As demonstrated, given the fact that two proteins are capable of forming direct physical interactions and using only information about their sequence composition, we have developed a completely new method for predicting a functional class of TF interacting protein partners with high precision and accuracy.
PMCID: PMC3709904  PMID: 23874789
3.  PIMiner: a web tool for extraction of protein interactions from biomedical literature 
Information on protein interactions (PIs) is valuable for biomedical research, but often lies buried in the scientific literature and cannot be readily retrieved. While much progress has been made over the years in extracting PIs from the literature using computational methods, there is a lack of free, public, user-friendly tools for the discovery of PIs. We developed PIMiner, an online tool for the extraction of PI relationships from PubMed-abstracts. Protein pairs and the words that describe their interactions are reported by PIMiner along with the interaction likelihood levels, so that new interactions can be readily detected within text. The option to extract only specific types of interactions is also provided. The PIMiner server can be accessed through a web browser or remotely through a client’s command line. PIMiner can process 50,000 PubMed abstracts in approximately seven minutes and is thus suitable for large scale processing of biological literature.
PMCID: PMC4303605  PMID: 23798227
Protein interactions; PIs; literature mining; biological textmining; systems biology; interactome mining; data mining; bioinformatics; complex networks
4.  Mutations and Binding Sites of Human Transcription Factors 
Frontiers in Genetics  2012;3:100.
Mutations in any genome may lead to phenotype characteristics that determine ability of an individual to cope with adaptation to environmental challenges. In studies of human biology, among the most interesting ones are phenotype characteristics that determine responses to drug treatments, response to infections, or predisposition to specific inherited diseases. Most of the research in this field has been focused on the studies of mutation effects on the final gene products, peptides, and their alterations. Considerably less attention was given to the mutations that may affect regulatory mechanism(s) of gene expression, although these may also affect the phenotype characteristics. In this study we make a pilot analysis of mutations observed in the regulatory regions of 24,667 human RefSeq genes. Our study reveals that out of eight studied mutation types, “insertions” are the only one that in a statistically significant manner alters predicted transcription factor binding sites (TFBSs). We also find that 25 families of TFBSs have been altered by mutations in a statistically significant manner in the promoter regions we considered. Moreover, we find that the related transcription factors are, for example, prominent in processes related to intracellular signaling; cell fate; morphogenesis of organs and epithelium; development of urogenital system, epithelium, and tube; neuron fate commitment. Our study highlights the significance of studying mutations within the genes regulatory regions and opens way for further detailed investigations on this topic, particularly on the downstream affected pathways.
PMCID: PMC3365286  PMID: 22670148
SNP; insertion; deletion; mutation; transcription factor; transcription factor binding site; promoter region; bioinformatics
5.  DEEP: a general computational framework for predicting enhancers 
Nucleic Acids Research  2014;43(1):e6.
Transcription regulation in multicellular eukaryotes is orchestrated by a number of DNA functional elements located at gene regulatory regions. Some regulatory regions (e.g. enhancers) are located far away from the gene they affect. Identification of distal regulatory elements is a challenge for the bioinformatics research. Although existing methodologies increased the number of computationally predicted enhancers, performance inconsistency of computational models across different cell-lines, class imbalance within the learning sets and ad hoc rules for selecting enhancer candidates for supervised learning, are some key questions that require further examination. In this study we developed DEEP, a novel ensemble prediction framework. DEEP integrates three components with diverse characteristics that streamline the analysis of enhancer's properties in a great variety of cellular conditions. In our method we train many individual classification models that we combine to classify DNA regions as enhancers or non-enhancers. DEEP uses features derived from histone modification marks or attributes coming from sequence characteristics. Experimental results indicate that DEEP performs better than four state-of-the-art methods on the ENCODE data. We report the first computational enhancer prediction results on FANTOM5 data where DEEP achieves 90.2% accuracy and 90% geometric mean (GM) of specificity and sensitivity across 36 different tissues. We further present results derived using in vivo-derived enhancer data from VISTA database. DEEP-VISTA, when tested on an independent test set, achieved GM of 80.1% and accuracy of 89.64%. DEEP framework is publicly available at
PMCID: PMC4288148  PMID: 25378307
6.  Chemical Compounds Toxic to Invertebrates Isolated from Marine Cyanobacteria of Potential Relevance to the Agricultural Industry 
Toxins  2014;6(11):3058-3076.
In spite of advances in invertebrate pest management, the agricultural industry is suffering from impeded pest control exacerbated by global climate changes that have altered rain patterns to favour opportunistic breeding. Thus, novel naturally derived chemical compounds toxic to both terrestrial and aquatic invertebrates are of interest, as potential pesticides. In this regard, marine cyanobacterium-derived metabolites that are toxic to both terrestrial and aquatic invertebrates continue to be a promising, but neglected, source of potential pesticides. A PubMed query combined with hand-curation of the information from retrieved articles allowed for the identification of 36 cyanobacteria-derived chemical compounds experimentally confirmed as being toxic to invertebrates. These compounds are discussed in this review.
PMCID: PMC4247248  PMID: 25356733
cyanobacteria; moluscicide; snail; slugs; worms; crustacean; brine shrimp; invertebrate; environmental; agriculture; climate change; toxic compounds
7.  DEOP: a database on osmoprotectants and associated pathways 
Microorganisms are known to counteract salt stress through salt influx or by the accumulation of osmoprotectants (also called compatible solutes). Understanding the pathways that synthesize and/or breakdown these osmoprotectants is of interest to studies of crops halotolerance and to biotechnology applications that use microbes as cell factories for production of biomass or commercial chemicals. To facilitate the exploration of osmoprotectants, we have developed the first online resource, ‘Dragon Explorer of Osmoprotection associated Pathways’ (DEOP) that gathers and presents curated information about osmoprotectants, complemented by information about reactions and pathways that use or affect them. A combined total of 141 compounds were confirmed osmoprotectants, which were matched to 1883 reactions and 834 pathways. DEOP can also be used to map genes or microbial genomes to potential osmoprotection-associated pathways, and thus link genes and genomes to other associated osmoprotection information. Moreover, DEOP provides a text-mining utility to search deeper into the scientific literature for supporting evidence or for new associations of osmoprotectants to pathways, reactions, enzymes, genes or organisms. Two case studies are provided to demonstrate the usefulness of DEOP. The system can be accessed at.
Database URL:
PMCID: PMC4201361  PMID: 25326239
8.  Promoter Analysis Reveals Globally Differential Regulation of Human Long Non-Coding RNA and Protein-Coding Genes 
PLoS ONE  2014;9(10):e109443.
Transcriptional regulation of protein-coding genes is increasingly well-understood on a global scale, yet no comparable information exists for long non-coding RNA (lncRNA) genes, which were recently recognized to be as numerous as protein-coding genes in mammalian genomes. We performed a genome-wide comparative analysis of the promoters of human lncRNA and protein-coding genes, finding global differences in specific genetic and epigenetic features relevant to transcriptional regulation. These two groups of genes are hence subject to separate transcriptional regulatory programs, including distinct transcription factor (TF) proteins that significantly favor lncRNA, rather than coding-gene, promoters. We report a specific signature of promoter-proximal transcriptional regulation of lncRNA genes, including several distinct transcription factor binding sites (TFBS). Experimental DNase I hypersensitive site profiles are consistent with active configurations of these lncRNA TFBS sets in diverse human cell types. TFBS ChIP-seq datasets confirm the binding events that we predicted using computational approaches for a subset of factors. For several TFs known to be directly regulated by lncRNAs, we find that their putative TFBSs are enriched at lncRNA promoters, suggesting that the TFs and the lncRNAs may participate in a bidirectional feedback loop regulatory network. Accordingly, cells may be able to modulate lncRNA expression levels independently of mRNA levels via distinct regulatory pathways. Our results also raise the possibility that, given the historical reliance on protein-coding gene catalogs to define the chromatin states of active promoters, a revision of these chromatin signature profiles to incorporate expressed lncRNA genes is warranted in the future.
PMCID: PMC4183604  PMID: 25275320
9.  Aerobic methanotrophic communities at the Red Sea brine-seawater interface 
The central rift of the Red Sea contains 25 brine pools with different physicochemical conditions, dictating the diversity and abundance of the microbial community. Three of these pools, the Atlantis II, Kebrit and Discovery Deeps, are uniquely characterized by a high concentration of hydrocarbons. The brine-seawater interface, described as an anoxic-oxic (brine-seawater) boundary, is characterized by a high methane concentration, thus favoring aerobic methane oxidation. The current study analyzed the aerobic free–living methane-oxidizing bacterial communities that potentially contribute to methane oxidation at the brine-seawater interfaces of the three aforementioned brine pools, using metagenomic pyrosequencing, 16S rRNA pyrotags and pmoA library constructs. The sequencing of 16S rRNA pyrotags revealed that these interfaces are characterized by high microbial community diversity. Signatures of aerobic methane-oxidizing bacteria were detected in the Atlantis II Interface (ATII-I) and the Kebrit Deep Upper (KB-U) and Lower (KB-L) brine-seawater interfaces. Through phylogenetic analysis of pmoA, we further demonstrated that the ATII-I aerobic methanotroph community is highly diverse. We propose four ATII-I pmoA clusters. Most importantly, cluster 2 groups with marine methane seep methanotrophs, and cluster 4 represent a unique lineage of an uncultured bacterium with divergent alkane monooxygenases. Moreover, non-metric multidimensional scaling (NMDS) based on the ordination of putative enzymes involved in methane metabolism showed that the Kebrit interface layers were distinct from the ATII-I and DD-I brine-seawater interfaces.
PMCID: PMC4172156  PMID: 25295031
brine-seawater interfaces; aerobic methanotrophs; pmoA; 16S rRNA gene; Red Sea
10.  Genome Sequence of Pseudomonas sp. Strain Chol1, a Model Organism for the Degradation of Bile Salts and Other Steroid Compounds 
Genome Announcements  2013;1(1):e00014-12.
Bacterial degradation of steroid compounds is of high ecological and biotechnological relevance. Pseudomonas sp. strain Chol1 is a model organism for studying the degradation of the steroid compound cholate. Its draft genome sequence is presented and reveals one gene cluster responsible for the metabolism of steroid compounds.
PMCID: PMC3569358  PMID: 23405354
11.  Core Microbial Functional Activities in Ocean Environments Revealed by Global Metagenomic Profiling Analyses 
PLoS ONE  2014;9(6):e97338.
Metagenomics-based functional profiling analysis is an effective means of gaining deeper insight into the composition of marine microbial populations and developing a better understanding of the interplay between the functional genome content of microbial communities and abiotic factors. Here we present a comprehensive analysis of 24 datasets covering surface and depth-related environments at 11 sites around the world's oceans. The complete datasets comprises approximately 12 million sequences, totaling 5,358 Mb. Based on profiling patterns of Clusters of Orthologous Groups (COGs) of proteins, a core set of reference photic and aphotic depth-related COGs, and a collection of COGs that are associated with extreme oxygen limitation were defined. Their inferred functions were utilized as indicators to characterize the distribution of light- and oxygen-related biological activities in marine environments. The results reveal that, while light level in the water column is a major determinant of phenotypic adaptation in marine microorganisms, oxygen concentration in the aphotic zone has a significant impact only in extremely hypoxic waters. Phylogenetic profiling of the reference photic/aphotic gene sets revealed a greater variety of source organisms in the aphotic zone, although the majority of individual photic and aphotic depth-related COGs are assigned to the same taxa across the different sites. This increase in phylogenetic and functional diversity of the core aphotic related COGs most probably reflects selection for the utilization of a broad range of alternate energy sources in the absence of light.
PMCID: PMC4055538  PMID: 24921648
12.  Mining a database of single amplified genomes from Red Sea brine pool extremophiles—improving reliability of gene function prediction using a profile and pattern matching algorithm (PPMA) 
Reliable functional annotation of genomic data is the key-step in the discovery of novel enzymes. Intrinsic sequencing data quality problems of single amplified genomes (SAGs) and poor homology of novel extremophile's genomes pose significant challenges for the attribution of functions to the coding sequences identified. The anoxic deep-sea brine pools of the Red Sea are a promising source of novel enzymes with unique evolutionary adaptation. Sequencing data from Red Sea brine pool cultures and SAGs are annotated and stored in the Integrated Data Warehouse of Microbial Genomes (INDIGO) data warehouse. Low sequence homology of annotated genes (no similarity for 35% of these genes) may translate into false positives when searching for specific functions. The Profile and Pattern Matching (PPM) strategy described here was developed to eliminate false positive annotations of enzyme function before progressing to labor-intensive hyper-saline gene expression and characterization. It utilizes InterPro-derived Gene Ontology (GO)-terms (which represent enzyme function profiles) and annotated relevant PROSITE IDs (which are linked to an amino acid consensus pattern). The PPM algorithm was tested on 15 protein families, which were selected based on scientific and commercial potential. An initial list of 2577 enzyme commission (E.C.) numbers was translated into 171 GO-terms and 49 consensus patterns. A subset of INDIGO-sequences consisting of 58 SAGs from six different taxons of bacteria and archaea were selected from six different brine pool environments. Those SAGs code for 74,516 genes, which were independently scanned for the GO-terms (profile filter) and PROSITE IDs (pattern filter). Following stringent reliability filtering, the non-redundant hits (106 profile hits and 147 pattern hits) are classified as reliable, if at least two relevant descriptors (GO-terms and/or consensus patterns) are present. Scripts for annotation, as well as for the PPM algorithm, are available through the INDIGO website.
PMCID: PMC3985023  PMID: 24778629
bioinformatics; single amplified genomes; halophiles; extermophile; protein sequence consensus patterns; PROSITE IDs; GO-terms; functional genomics
13.  Symbiotic Adaptation Drives Genome Streamlining of the Cyanobacterial Sponge Symbiont “Candidatus Synechococcus spongiarum” 
mBio  2014;5(2):e00079-14.
“Candidatus Synechococcus spongiarum” is a cyanobacterial symbiont widely distributed in sponges, but its functions at the genome level remain unknown. Here, we obtained the draft genome (1.66 Mbp, 90% estimated genome recovery) of “Ca. Synechococcus spongiarum” strain SH4 inhabiting the Red Sea sponge Carteriospongia foliascens. Phylogenomic analysis revealed a high dissimilarity between SH4 and free-living cyanobacterial strains. Essential functions, such as photosynthesis, the citric acid cycle, and DNA replication, were detected in SH4. Eukaryoticlike domains that play important roles in sponge-symbiont interactions were identified exclusively in the symbiont. However, SH4 could not biosynthesize methionine and polyamines and had lost partial genes encoding low-molecular-weight peptides of the photosynthesis complex, antioxidant enzymes, DNA repair enzymes, and proteins involved in resistance to environmental toxins and in biosynthesis of capsular and extracellular polysaccharides. These genetic modifications imply that “Ca. Synechococcus spongiarum” SH4 represents a low-light-adapted cyanobacterial symbiont and has undergone genome streamlining to adapt to the sponge’s mild intercellular environment.
Although the diversity of sponge-associated microbes has been widely studied, genome-level research on sponge symbionts and their symbiotic mechanisms is rare because they are unculturable. “Candidatus Synechococcus spongiarum” is a widely distributed uncultivated cyanobacterial sponge symbiont. The genome of this symbiont will help to characterize its evolutionary relationship and functional dissimilarity to closely related free-living cyanobacterial strains. Knowledge of its adaptive mechanism to the sponge host also depends on the genome-level research. The data presented here provided an alternative strategy to obtain the draft genome of “Ca. Synechococcus spongiarum” strain SH4 and provide insight into its evolutionary and functional features.
PMCID: PMC3977351  PMID: 24692632
14.  Effects of cytosine methylation on transcription factor binding sites 
BMC Genomics  2014;15:119.
DNA methylation in promoters is closely linked to downstream gene repression. However, whether DNA methylation is a cause or a consequence of gene repression remains an open question. If it is a cause, then DNA methylation may affect the affinity of transcription factors (TFs) for their binding sites (TFBSs). If it is a consequence, then gene repression caused by chromatin modification may be stabilized by DNA methylation. Until now, these two possibilities have been supported only by non-systematic evidence and they have not been tested on a wide range of TFs. An average promoter methylation is usually used in studies, whereas recent results suggested that methylation of individual cytosines can also be important.
We found that the methylation profiles of 16.6% of cytosines and the expression profiles of neighboring transcriptional start sites (TSSs) were significantly negatively correlated. We called the CpGs corresponding to such cytosines “traffic lights”. We observed a strong selection against CpG “traffic lights” within TFBSs. The negative selection was stronger for transcriptional repressors as compared with transcriptional activators or multifunctional TFs as well as for core TFBS positions as compared with flanking TFBS positions.
Our results indicate that direct and selective methylation of certain TFBS that prevents TF binding is restricted to special cases and cannot be considered as a general regulatory mechanism of transcription.
PMCID: PMC3986887  PMID: 24669864
DNA methylation; Transcription factor binding sites; Transcriptional regulation; CAGE; RRBS; CpG “traffic lights”; Bioinformatics; Computational biology
15.  Induction of apoptosis in cancer cell lines by the Red Sea brine pool bacterial extracts 
Marine microorganisms are considered to be an important source of bioactive molecules against various diseases and have great potential to increase the number of lead molecules in clinical trials. Progress in novel microbial culturing techniques as well as greater accessibility to unique oceanic habitats has placed the marine environment as a new frontier in the field of natural product drug discovery.
A total of 24 microbial extracts from deep-sea brine pools in the Red Sea have been evaluated for their anticancer potential against three human cancer cell lines. Downstream analysis of these six most potent extracts was done using various biological assays, such as Caspase-3/7 activity, mitochondrial membrane potential (MMP), PARP-1 cleavage and expression of γH2Ax, Caspase-8 and -9 using western blotting.
In general, most of the microbial extracts were found to be cytotoxic against one or more cancer cell lines with cell line specific activities. Out of the 13 most active microbial extracts, six extracts were able to induce significantly higher apoptosis (>70%) in cancer cells. Mechanism level studies revealed that extracts from Chromohalobacter salexigens (P3-86A and P3-86B(2)) followed the sequence of events of apoptotic pathway involving MMP disruption, caspase-3/7 activity, caspase-8 cleavage, PARP-1 cleavage and Phosphatidylserine (PS) exposure, whereas another Chromohalobacter salexigens extract (K30) induced caspase-9 mediated apoptosis. The extracts from Halomonas meridiana (P3-37B), Chromohalobacter israelensis (K18) and Idiomarina loihiensis (P3-37C) were unable to induce any change in MMP in HeLa cancer cells, and thus suggested mitochondria-independent apoptosis induction. However, further detection of a PARP-1 cleavage product, and the observed changes in caspase-8 and -9 suggested the involvement of caspase-mediated apoptotic pathways.
Altogether, the study offers novel findings regarding the anticancer potential of several halophilic bacterial species inhabiting the Red Sea (at the depth of 1500–2500 m), which constitute valuable candidates for further isolation and characterization of bioactive molecules.
PMCID: PMC4235048  PMID: 24305113
Natural products; Cancer; Apoptosis; Marine bacteria; Deep-sea brine pools
16.  Combining Position Weight Matrices and Document-Term Matrix for Efficient Extraction of Associations of Methylated Genes and Diseases from Free Text 
PLoS ONE  2013;8(10):e77848.
In a number of diseases, certain genes are reported to be strongly methylated and thus can serve as diagnostic markers in many cases. Scientific literature in digital form is an important source of information about methylated genes implicated in particular diseases. The large volume of the electronic text makes it difficult and impractical to search for this information manually.
We developed a novel text mining methodology based on a new concept of position weight matrices (PWMs) for text representation and feature generation. We applied PWMs in conjunction with the document-term matrix to extract with high accuracy associations between methylated genes and diseases from free text. The performance results are based on large manually-classified data. Additionally, we developed a web-tool, DEMGD, which automates extraction of these associations from free text. DEMGD presents the extracted associations in summary tables and full reports in addition to evidence tagging of text with respect to genes, diseases and methylation words. The methodology we developed in this study can be applied to similar association extraction problems from free text.
The new methodology developed in this study allows for efficient identification of associations between concepts. Our method applied to methylated genes in different diseases is implemented as a Web-tool, DEMGD, which is freely available at The data is available for online browsing and download.
PMCID: PMC3797705  PMID: 24147091
17.  Exploration of miRNA families for hypotheses generation 
Scientific Reports  2013;3:2940.
Technological improvements have resulted in increased discovery of new microRNAs (miRNAs) and refinement and enrichment of existing miRNA families. miRNA families are important because they suggest a common sequence or structure configuration in sets of genes that hint to a shared function. Exploratory tools to enhance investigation of characteristics of miRNA families and the functions of family-specific miRNA genes are lacking. We have developed, miRNAVISA, a user-friendly web-based tool that allows customized interrogation and comparisons of miRNA families for hypotheses generation, and comparison of per-species chromosomal distribution of miRNA genes in different families. This study illustrates hypothesis generation using miRNAVISA in seven species. Our results unveil a subclass of miRNAs that may be regulated by genomic imprinting, and also suggest that some miRNA families may be species-specific, as well as chromosome- and/or strand-specific.
PMCID: PMC3796740  PMID: 24126940
18.  Comparing Memory-Efficient Genome Assemblers on Stand-Alone and Cloud Infrastructures 
PLoS ONE  2013;8(9):e75505.
A fundamental problem in bioinformatics is genome assembly. Next-generation sequencing (NGS) technologies produce large volumes of fragmented genome reads, which require large amounts of memory to assemble the complete genome efficiently. With recent improvements in DNA sequencing technologies, it is expected that the memory footprint required for the assembly process will increase dramatically and will emerge as a limiting factor in processing widely available NGS-generated reads. In this report, we compare current memory-efficient techniques for genome assembly with respect to quality, memory consumption and execution time. Our experiments prove that it is possible to generate draft assemblies of reasonable quality on conventional multi-purpose computers with very limited available memory by choosing suitable assembly methods. Our study reveals the minimum memory requirements for different assembly programs even when data volume exceeds memory capacity by orders of magnitude. By combining existing methodologies, we propose two general assembly strategies that can improve short-read assembly approaches and result in reduction of the memory footprint. Finally, we discuss the possibility of utilizing cloud infrastructures for genome assembly and we comment on some findings regarding suitable computational resources for assembly.
PMCID: PMC3785575  PMID: 24086547
19.  HMCan: a method for detecting chromatin modifications in cancer samples using ChIP-seq data 
Bioinformatics  2013;29(23):2979-2986.
Motivation: Cancer cells are often characterized by epigenetic changes, which include aberrant histone modifications. In particular, local or regional epigenetic silencing is a common mechanism in cancer for silencing expression of tumor suppressor genes. Though several tools have been created to enable detection of histone marks in ChIP-seq data from normal samples, it is unclear whether these tools can be efficiently applied to ChIP-seq data generated from cancer samples. Indeed, cancer genomes are often characterized by frequent copy number alterations: gains and losses of large regions of chromosomal material. Copy number alterations may create a substantial statistical bias in the evaluation of histone mark signal enrichment and result in underdetection of the signal in the regions of loss and overdetection of the signal in the regions of gain.
Results: We present HMCan (Histone modifications in cancer), a tool specially designed to analyze histone modification ChIP-seq data produced from cancer genomes. HMCan corrects for the GC-content and copy number bias and then applies Hidden Markov Models to detect the signal from the corrected data. On simulated data, HMCan outperformed several commonly used tools developed to analyze histone modification data produced from genomes without copy number alterations. HMCan also showed superior results on a ChIP-seq dataset generated for the repressive histone mark H3K27me3 in a bladder cancer cell line. HMCan predictions matched well with experimental data (qPCR validated regions) and included, for example, the previously detected H3K27me3 mark in the promoter of the DLEC1 gene, missed by other tools we tested.
Availability: Source code and binaries can be downloaded at, implemented in C++.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3834794  PMID: 24021381
20.  Poly(A) motif prediction using spectral latent features from human DNA sequences 
Bioinformatics  2013;29(13):i316-i325.
Motivation: Polyadenylation is the addition of a poly(A) tail to an RNA molecule. Identifying DNA sequence motifs that signal the addition of poly(A) tails is essential to improved genome annotation and better understanding of the regulatory mechanisms and stability of mRNA.
Existing poly(A) motif predictors demonstrate that information extracted from the surrounding nucleotide sequences of candidate poly(A) motifs can differentiate true motifs from the false ones to a great extent. A variety of sophisticated features has been explored, including sequential, structural, statistical, thermodynamic and evolutionary properties. However, most of these methods involve extensive manual feature engineering, which can be time-consuming and can require in-depth domain knowledge.
Results: We propose a novel machine-learning method for poly(A) motif prediction by marrying generative learning (hidden Markov models) and discriminative learning (support vector machines). Generative learning provides a rich palette on which the uncertainty and diversity of sequence information can be handled, while discriminative learning allows the performance of the classification task to be directly optimized. Here, we used hidden Markov models for fitting the DNA sequence dynamics, and developed an efficient spectral algorithm for extracting latent variable information from these models. These spectral latent features were then fed into support vector machines to fine-tune the classification performance.
We evaluated our proposed method on a comprehensive human poly(A) dataset that consists of 14 740 samples from 12 of the most abundant variants of human poly(A) motifs. Compared with one of the previous state-of-the-art methods in the literature (the random forest model with expert-crafted features), our method reduces the average error rate, false-negative rate and false-positive rate by 26, 15 and 35%, respectively. Meanwhile, our method makes ∼30% fewer error predictions relative to the other string kernels. Furthermore, our method can be used to visualize the importance of oligomers and positions in predicting poly(A) motifs, from which we can observe a number of characteristics in the surrounding regions of true and false motifs that have not been reported before.
Contact: or
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3694652  PMID: 23813000
21.  Information Exploration System for Sickle Cell Disease and Repurposing of Hydroxyfasudil 
PLoS ONE  2013;8(6):e65190.
Sickle cell disease (SCD) is a fatal monogenic disorder with no effective cure and thus high rates of morbidity and sequelae. Efforts toward discovery of disease modifying drugs and curative strategies can be augmented by leveraging the plethora of information contained in available biomedical literature. To facilitate research in this direction we have developed a resource, Dragon Exploration System for Sickle Cell Disease (DESSCD) ( that aims to promote the easy exploration of SCD-related data.
The Dragon Exploration System (DES), developed based on text mining and complemented by data mining, processed 419,612 MEDLINE abstracts retrieved from a PubMed query using SCD-related keywords. The processed SCD-related data has been made available via the DESSCD web query interface that enables: a/information retrieval using specified concepts, keywords and phrases, and b/the generation of inferred association networks and hypotheses. The usefulness of the system is demonstrated by: a/reproducing a known scientific fact, the “Sickle_Cell_Anemia–Hydroxyurea” association, and b/generating novel and plausible “Sickle_Cell_Anemia–Hydroxyfasudil” hypothesis. A PCT patent (PCT/US12/55042) has been filed for the latter drug repurposing for SCD treatment.
We developed the DESSCD resource dedicated to exploration of text-mined and data-mined information about SCD. No similar SCD-related resource exists. Thus, we anticipate that DESSCD will serve as a valuable tool for physicians and researchers interested in SCD.
PMCID: PMC3677893  PMID: 23762313
22.  On the classification of long non-coding RNAs 
RNA Biology  2013;10(6):924-933.
Long non-coding RNAs (lncRNAs) have been found to perform various functions in a wide variety of important biological processes. To make easier interpretation of lncRNA functionality and conduct deep mining on these transcribed sequences, it is convenient to classify lncRNAs into different groups. Here, we summarize classification methods of lncRNAs according to their four major features, namely, genomic location and context, effect exerted on DNA sequences, mechanism of functioning and their targeting mechanism. In combination with the presently available function annotations, we explore potential relationships between different classification categories, and generalize and compare biological features of different lncRNAs within each category. Finally, we present our view on potential further studies. We believe that the classifications of lncRNAs as indicated above are of fundamental importance for lncRNA studies, helpful for further investigation of specific lncRNAs, for formulation of new hypothesis based on different features of lncRNA and for exploration of the underlying lncRNA functional mechanisms.
PMCID: PMC4111732  PMID: 23696037
long non-coding RNA; lncRNA; lncRNA classification; RNA transcripts
23.  Dragon exploration system on marine sponge compounds interactions 
Natural products are considered a rich source of new chemical structures that may lead to the therapeutic agents in all major disease areas. About 50% of the drugs introduced in the market in the last 20 years were natural products/derivatives or natural products mimics, which clearly shows the influence of natural products in drug discovery.
In an effort to further support the research in this field, we have developed an integrative knowledge base on Marine Sponge Compounds Interactions (Dragon Exploration System on Marine Sponge Compounds Interactions - DESMSCI) as a web resource. This knowledge base provides information about the associations of the sponge compounds with different biological concepts such as human genes or proteins, diseases, as well as pathways, based on the literature information available in PubMed and information deposited in several other databases. As such, DESMSCI is aimed as a research support resource for problems on the utilization of marine sponge compounds. DESMSCI allows visualization of relationships between different chemical compounds and biological concepts through textual and tabular views, graphs and relational networks. In addition, DESMSCI has built in hypotheses discovery module that generates potentially new/interesting associations among different biomedical concepts. We also present a case study derived from the hypotheses generated by DESMSCI which provides a possible novel mode of action for variolins in Alzheimer’s disease.
DESMSCI is the first publicly available ( comprehensive resource where users can explore information, compiled by text- and data-mining approaches, on biological and chemical data related to sponge compounds.
PMCID: PMC3608955  PMID: 23415072
Sponge compounds interactions; Natural products; Text-mining; Information integration; Knowledge base
24.  Cytotoxic and apoptotic evaluations of marine bacteria isolated from brine-seawater interface of the Red Sea 
High salinity and temperature combined with presence of heavy metals and low oxygen renders deep-sea anoxic brines of the Red Sea as one of the most extreme environments on Earth. The ability to adapt and survive in these extreme environments makes inhabiting bacteria interesting candidates for the search of novel bioactive molecules.
Total 20 i.e. lipophilic (chloroform) and hydrophilic (70% ethanol) extracts of marine bacteria isolated from brine-seawater interface of the Red Sea were tested for cytotoxic and apoptotic activity against three human cancer cell lines, i.e. HeLa (cervical carcinoma), MCF-7 (Breast Adenocarcinoma) and DU145 (Prostate carcinoma).
Among these, twelve extracts were found to be very active after 24 hours of treatment, which were further evaluated for their cytotoxic and apoptotic effects at 48 hr. The extracts from the isolates P1-37B and P3-37A (Halomonas) and P1-17B (Sulfitobacter) have been found to be the most potent against tested cancer cell lines.
Overall, bacterial isolates from the Red Sea displayed promising results and can be explored further to find novel drug-like molecules. The cell line specific activity of the extracts may be attributed to the presence of different polarity compounds or the cancer type i.e. biological differences in cell lines and different mechanisms of action of programmed cell death prevalent in different cancer cell lines.
PMCID: PMC3598566  PMID: 23388148
Marine bacteria; Deep sea brine pools; Extracts; Cytotoxicity; Apoptosis
25.  HOCOMOCO: a comprehensive collection of human transcription factor binding sites models 
Nucleic Acids Research  2012;41(Database issue):D195-D202.
Transcription factor (TF) binding site (TFBS) models are crucial for computational reconstruction of transcription regulatory networks. In existing repositories, a TF often has several models (also called binding profiles or motifs), obtained from different experimental data. Having a single TFBS model for a TF is more pragmatic for practical applications. We show that integration of TFBS data from various types of experiments into a single model typically results in the improved model quality probably due to partial correction of source specific technique bias.
We present the Homo sapiens comprehensive model collection (HOCOMOCO,, containing carefully hand-curated TFBS models constructed by integration of binding sequences obtained by both low- and high-throughput methods. To construct position weight matrices to represent these TFBS models, we used ChIPMunk software in four computational modes, including newly developed periodic positional prior mode associated with DNA helix pitch. We selected only one TFBS model per TF, unless there was a clear experimental evidence for two rather distinct TFBS models. We assigned a quality rating to each model. HOCOMOCO contains 426 systematically curated TFBS models for 401 human TFs, where 172 models are based on more than one data source.
PMCID: PMC3531053  PMID: 23175603

Results 1-25 (78)