Halomonas salina strain CIFRI1 is an extremely salt-stress-tolerant bacterium isolated from the salt crystals of the east coast of India. Here we report the annotated 3.45-Mb draft genome sequence of strain CIFRI1 having 86 contigs with 3,139 protein coding loci, including 62 RNA genes.
Schistosomiasis is a neglected tropical disease caused by a parasite Schistosoma mansoni and affects over 200 million annually. There is an urgent need to discover novel therapeutic options to control the disease with the recent emergence of drug resistance. The multifunctional protein, thioredoxin glutathione reductase (TGR), an essential enzyme for the survival of the pathogen in the redox environment has been actively explored as a potential drug target. The recent availability of small-molecule screening datasets against this target provides a unique opportunity to learn molecular properties and apply computational models for discovery of activities in large molecular libraries. Such a prioritisation approach could have the potential to reduce the cost of failures in lead discovery. A supervised learning approach was employed to develop a cost sensitive classification model to evaluate the biological activity of the molecules. Random forest was identified to be the best classifier among all the classifiers with an accuracy of around 80 percent. Independent analysis using a maximally occurring substructure analysis revealed 10 highly enriched scaffolds in the actives dataset and their docking against was also performed. We show that a combined approach of machine learning and other cheminformatics approaches such as substructure comparison and molecular docking is efficient to prioritise molecules from large molecular datasets.
Background. Traditional Chinese medicine encompasses a well established alternate system of medicine based on a broad range of herbal formulations and is practiced extensively in the region for the treatment of a wide variety of diseases. In recent years, several reports describe in depth studies of the molecular ingredients of traditional Chinese medicines on the biological activities including anti-bacterial activities. The availability of a well-curated dataset of molecular ingredients of traditional Chinese medicines and accurate in-silico cheminformatics models for data mining for antitubercular agents and computational filters to prioritize molecules has prompted us to search for potential hits from these datasets.
Results. We used a consensus approach to predict molecules with potential antitubercular activities from a large dataset of molecular ingredients of traditional Chinese medicines available in the public domain. We further prioritized 160 molecules based on five computational filters (SMARTSfilter) so as to avoid potentially undesirable molecules. We further examined the molecules for permeability across Mycobacterial cell wall and for potential activities against non-replicating and drug tolerant Mycobacteria. Additional in-depth literature surveys for the reported antitubercular activities of the molecular ingredients and their sources were considered for drawing support to prioritization.
Conclusions. Our analysis suggests that datasets of molecular ingredients of traditional Chinese medicines offer a new opportunity to mine for potential biological activities. In this report, we suggest a proof-of-concept methodology to prioritize molecules for further experimental assays using a variety of computational tools. We also additionally suggest that a subset of prioritized molecules could be used for evaluation for tuberculosis due to their additional effect against non-replicating tuberculosis as well as the additional hepato-protection offered by the source of these ingredients.
Tuberculosis; Traditional Chinese medicine; Cheminformatics; Virtual screening; Data-mining
Turner syndrome is a chromosomal abnormality characterized by the absence of whole or part of the X chromosome in females. This X aneuploidy condition is associated with a diverse set of clinical phenotypes such as gonadal dysfunction, short stature, osteoporosis and Type II diabetes mellitus, among others. These phenotypes differ in their severity and penetrance among the affected individuals. Haploinsufficiency for a few X linked genes has been associated with some of these disease phenotypes. RNA sequencing can provide valuable insights to understand molecular mechanism of disease process. In the current study, we have analysed the transcriptome profiles of human untransformed 45,X and 46,XX fibroblast cells and identified differential expression of genes in these two karyotypes. Functional analysis revealed that these differentially expressing genes are associated with bone differentiation, glucose metabolism and gonadal development pathways. We also report differential expression of lincRNAs in X monosomic cells. Our observations provide a basis for evaluation of cellular and molecular mechanism(s) in the establishment of Turner syndrome phenotypes.
Indians undergoing socioeconomic and lifestyle transitions will be maximally affected by epidemic of type 2 diabetes (T2D). We conducted a two-stage genome-wide association study of T2D in 12,535 Indians, a less explored but high-risk group. We identified a new type 2 diabetes–associated locus at 2q21, with the lead signal being rs6723108 (odds ratio 1.31; P = 3.32 × 10−9). Imputation analysis refined the signal to rs998451 (odds ratio 1.56; P = 6.3 × 10−12) within TMEM163 that encodes a probable vesicular transporter in nerve terminals. TMEM163 variants also showed association with decreased fasting plasma insulin and homeostatic model assessment of insulin resistance, indicating a plausible effect through impaired insulin secretion. The 2q21 region also harbors RAB3GAP1 and ACMSD; those are involved in neurologic disorders. Forty-nine of 56 previously reported signals showed consistency in direction with similar effect sizes in Indians and previous studies, and 25 of them were also associated (P < 0.05). Known loci and the newly identified 2q21 locus altogether explained 7.65% variance in the risk of T2D in Indians. Our study suggests that common susceptibility variants for T2D are largely the same across populations, but also reveals a population-specific locus and provides further insights into genetic architecture and etiology of T2D.
Zebrafish (Danio rerio) is a popular vertebrate model organism largely deployed using outbred laboratory animals. The nonisogenic nature of the zebrafish as a model system offers the opportunity to understand natural variations and their effect in modulating phenotype. In an effort to better characterize the range of natural variation in this model system and to complement the zebrafish reference genome project, the whole genome sequence of a wild zebrafish at 39-fold genome coverage was determined. Comparative analysis with the zebrafish reference genome revealed approximately 5.2 million single nucleotide variations and over 1.6 million insertion–deletion variations. This dataset thus represents a new catalog of genetic variations in the zebrafish genome. Further analysis revealed selective enrichment for variations in genes involved in immune function and response to the environment, suggesting genome-level adaptations to environmental niches. We also show that human disease gene orthologs in the sequenced wild zebrafish genome show a lower ratio of nonsynonymous to synonymous single nucleotide variations.
A large repertoire of gene-centric data has been generated in the field of zebrafish biology. Although the bulk of these data are available in the public domain, most of them are not readily accessible or available in nonstandard formats. One major challenge is to unify and integrate these widely scattered data sources. We tested the hypothesis that active community participation could be a viable option to address this challenge. We present here our approach to create standards for assimilation and sharing of information and a system of open standards for database intercommunication. We have attempted to address this challenge by creating a community-centric solution for zebrafish gene annotation. The Zebrafish GenomeWiki is a ‘wiki’-based resource, which aims to provide an altruistic shared environment for collective annotation of the zebrafish genes. The Zebrafish GenomeWiki has features that enable users to comment, annotate, edit and rate this gene-centric information. The credits for contributions can be tracked through a transparent microattribution system. In contrast to other wikis, the Zebrafish GenomeWiki is a ‘structured wiki’ or rather a ‘semantic wiki’. The Zebrafish GenomeWiki implements a semantically linked data structure, which in the future would be amenable to semantic search.
We describe here the draft genome sequence of Sporosarcina pasteurii, a urease-producing bacterium with potential applications in biocement production.
Mycobacterium tuberculosis, along with closely related species, commonly known as M. tuberculosis complex (MTBC), causes tuberculosis in humans and other organisms. Tuberculosis is a disease with high morbidity and mortality, especially in the third world. The genetic variability between clinical isolates of MTBC has been poorly understood, although recent years have seen the re-sequencing of a large number of clinical isolates of MTBC from around the world. The availability of genomic data of multiple isolates in public domain would potentially offer a unique opportunity toward understanding the variome of the organism and the functional consequences of the variations. This nevertheless has been limited by the lack of systematic curation and analysis of data sets available in public domain. In this report, we have re-analyzed re-sequencing data sets corresponding to >450 isolates of MTBC available in public domain to create a comprehensive variome map of MTBC comprising >29 000 single nucleotide variations. Using a systematic computational pipeline, we have annotated potential functional variants and drug-resistance-associated variants from the variome. We have made available this data set as a searchable database. Apart from a user-friendly interface, the database also has a novel option to annotate variants from clinical re-sequencing data sets of MTBC. To the best of our knowledge, tbvar is the largest and most comprehensive genome variation resources for MTBC.
Long non-coding RNAs (lncRNA) represent an assorted class of transcripts having little or no protein coding capacity and have recently gained importance for their function as regulators of gene expression. Molecular studies on lncRNA have uncovered multifaceted interactions with protein coding genes. It has been suggested that lncRNAs are an additional layer of regulatory switches involved in gene regulation during development and disease. LncRNAs expressing in specific tissues or cell types during adult stages can have potential roles in form, function, maintenance and repair of tissues and organs. We used RNA sequencing followed by computational analysis to identify tissue restricted lncRNA transcript signatures from five different tissues of adult zebrafish. The present study reports 442 predicted lncRNA transcripts from adult zebrafish tissues out of which 419 were novel lncRNA transcripts. Of these, 77 lncRNAs show predominant tissue restricted expression across the five major tissues investigated. Adult zebrafish brain expressed the largest number of tissue restricted lncRNA transcripts followed by cardiovascular tissue. We also validated the tissue restricted expression of a subset of lncRNAs using independent methods. Our data constitute a useful genomic resource towards understanding the expression of lncRNAs in various tissues in adult zebrafish. Our study is thus a starting point and opens a way towards discovering new molecular interactions of gene expression within the specific adult tissues in the context of maintenance of organ form and function.
We describe the genome sequencing and analysis of a clinical isolate of the multidrug-resistant Mycobacterium tuberculosis Uganda I genotype (OSDD515) from India.
We describe the genome sequencing and analysis of a multidrug-resistant (MDR) clinical isolate of Mycobacterium tuberculosis, strain OSDD105 from India, belonging to a novel spoligotype.
Leishmaniasis is a neglected tropical disease which affects approx. 12 million individuals worldwide and caused by parasite Leishmania. The current drugs used in the treatment of Leishmaniasis are highly toxic and has seen widespread emergence of drug resistant strains which necessitates the need for the development of new therapeutic options. The high throughput screen data available has made it possible to generate computational predictive models which have the ability to assess the active scaffolds in a chemical library followed by its ADME/toxicity properties in the biological trials.
In the present study, we have used publicly available, high-throughput screen datasets of chemical moieties which have been adjudged to target the pyruvate kinase enzyme of L. mexicana (LmPK). The machine learning approach was used to create computational models capable of predicting the biological activity of novel antileishmanial compounds. Further, we evaluated the molecules using the substructure based approach to identify the common substructures contributing to their activity.
We generated computational models based on machine learning methods and evaluated the performance of these models based on various statistical figures of merit. Random forest based approach was determined to be the most sensitive, better accuracy as well as ROC. We further added a substructure based approach to analyze the molecules to identify potentially enriched substructures in the active dataset. We believe that the models developed in the present study would lead to reduction in cost and length of clinical studies and hence newer drugs would appear faster in the market providing better healthcare options to the patients.
We describe the genome sequencing and analysis of a clinical isolate of Mycobacterium tuberculosis belonging to the Ural strain OSDD493 from India.
With a higher throughput and lower cost in sequencing, second generation sequencing technology has immense potential for translation into clinical practice and in the realization of pharmacogenomics based patient care. The systematic analysis of whole genome sequences to assess patient to patient variability in pharmacokinetics and pharmacodynamics responses towards drugs would be the next step in future medicine in line with the vision of personalizing medicine.
Genomic DNA obtained from a 55 years old, self-declared healthy, anonymous male of Malay descent was sequenced. The subject's mother died of lung cancer and the father had a history of schizophrenia and deceased at the age of 65 years old. A systematic, intuitive computational workflow/pipeline integrating custom algorithm in tandem with large datasets of variant annotations and gene functions for genetic variations with pharmacogenomics impact was developed. A comprehensive pathway map of drug transport, metabolism and action was used as a template to map non-synonymous variations with potential functional consequences.
Over 3 million known variations and 100,898 novel variations in the Malay genome were identified. Further in-depth pharmacogenetics analysis revealed a total of 607 unique variants in 563 proteins, with the eventual identification of 4 drug transport genes, 2 drug metabolizing enzyme genes and 33 target genes harboring deleterious SNVs involved in pharmacological pathways, which could have a potential role in clinical settings.
The current study successfully unravels the potential of personal genome sequencing in understanding the functionally relevant variations with potential influence on drug transport, metabolism and differential therapeutic outcomes. These will be essential for realizing personalized medicine through the use of comprehensive computational pipeline for systematic data mining and analysis.
We describe the genome sequencing and analysis of a clinical isolate of Mycobacterium tuberculosis East African Indian (EAI) strain OSDD271 from India.
The advent of high-throughput genome scale technologies has enabled us to unravel a large amount of the previously unknown transcriptionally active regions of the genome. Recent genome-wide studies have provided annotations of a large repertoire of various classes of noncoding transcripts. Long noncoding RNAs (lncRNAs) form a major proportion of these novel annotated noncoding transcripts, and presently known to be involved in a number of functionally distinct biological processes. Over 18 000 transcripts are presently annotated as lncRNA, and encompass previously annotated classes of noncoding transcripts including large intergenic noncoding RNA, antisense RNA and processed pseudogenes. There is a significant gap in the resources providing a stable annotation, cross-referencing and biologically relevant information. lncRNome has been envisioned with the aim of filling this gap by integrating annotations on a wide variety of biologically significant information into a comprehensive knowledgebase. To the best of our knowledge, lncRNome is one of the largest and most comprehensive resources for lncRNAs.
Human mitochondrial DNA (mtDNA) encodes a set of 37 genes which are essential structural and functional components of the electron transport chain. Variations in these genes have been implicated in a broad spectrum of diseases and are extensively reported in literature and various databases. In this study, we describe MitoLSDB, an integrated platform to catalogue disease association studies on mtDNA (http://mitolsdb.igib.res.in). The main goal of MitoLSDB is to provide a central platform for direct submissions of novel variants that can be curated by the Mitochondrial Research Community. MitoLSDB provides access to standardized and annotated data from literature and databases encompassing information from 5231 individuals, 675 populations and 27 phenotypes. This platform is developed using the Leiden Open (source) Variation Database (LOVD) software. MitoLSDB houses information on all 37 genes in each population amounting to 132397 variants, 5147 unique variants. For each variant its genomic location as per the Revised Cambridge Reference Sequence, codon and amino acid change for variations in protein-coding regions, frequency, disease/phenotype, population, reference and remarks are also listed. MitoLSDB curators have also reported errors documented in literature which includes 94 phantom mutations, 10 NUMTs, six documentation errors and one artefactual recombination. MitoLSDB is the largest repository of mtDNA variants systematically standardized and presented using the LOVD platform. We believe that this is a good starting resource to curate mtDNA variants and will facilitate direct submissions enhancing data coverage, annotation in context of pathogenesis and quality control by ensuring non-redundancy in reporting novel disease associated variants.
Malaria is a major healthcare problem worldwide resulting in an estimated 0.65 million deaths every year. It is caused by the members of the parasite genus Plasmodium. The current therapeutic options for malaria are limited to a few classes of molecules, and are fast shrinking due to the emergence of widespread resistance to drugs in the pathogen. The recent availability of high-throughput phenotypic screen datasets for antimalarial activity offers a possibility to create computational models for bioactivity based on chemical descriptors of molecules with potential to accelerate drug discovery for malaria.
In the present study, we have used high-throughput screen datasets for the discovery of apicoplast inhibitors of the malarial pathogen as assayed from the delayed death response. We employed machine learning approach and developed computational predictive models to predict the biological activity of new antimalarial compounds. The molecules were further evaluated for common substructures using a Maximum Common Substructure (MCS) based approach.
We created computational models using state-of-the-art machine learning algorithms. The models were evaluated based on multiple statistical criteria. We found Random Forest based approach provides for better accuracy as assessed from ROC curve analysis. We further evaluated the active molecules using a substructure based approach to identify common substructures enriched in the active set. We argue that the computational models generated could be effectively used to screen large molecular datasets to prioritize them for phenotypic screens, drastically reducing cost while improving the hit rate.
Long noncoding RNAs (lncRNAs) are a recently discovered class of non-protein coding RNAs, which have now increasingly been shown to be involved in a wide variety of biological processes as regulatory molecules. The functional role of many of the members of this class has been an enigma, except a few of them like Malat and HOTAIR. Little is known regarding the regulatory interactions between noncoding RNA classes. Recent reports have suggested that lncRNAs could potentially interact with other classes of non-coding RNAs including microRNAs (miRNAs) and modulate their regulatory role through interactions. We hypothesized that lncRNAs could participate as a layer of regulatory interactions with miRNAs. The availability of genome-scale datasets for Argonaute targets across human transcriptome has prompted us to reconstruct a genome-scale network of interactions between miRNAs and lncRNAs.
We used well characterized experimental Photoactivatable-Ribonucleoside-Enhanced Crosslinking and Immunoprecipitation (PAR-CLIP) datasets and the recent genome-wide annotations for lncRNAs in public domain to construct a comprehensive transcriptome-wide map of miRNA regulatory elements. Comparative analysis revealed that in addition to targeting protein-coding transcripts, miRNAs could also potentially target lncRNAs, thus participating in a novel layer of regulatory interactions between noncoding RNA classes. Furthermore, we have modeled one example of miRNA-lncRNA interaction using a zebrafish model. We have also found that the miRNA regulatory elements have a positional preference, clustering towards the mid regions and 3′ ends of the long noncoding transcripts. We also further reconstruct a genome-wide map of miRNA interactions with lncRNAs as well as messenger RNAs.
This analysis suggests widespread regulatory interactions between noncoding RNAs classes and suggests a novel functional role for lncRNAs. We also present the first transcriptome scale study on miRNA-lncRNA interactions and the first report of a genome-scale reconstruction of a noncoding RNA regulatory interactome involving lncRNAs.
Long non-coding RNA have emerged as an increasingly well studied subset of non-coding RNAs (ncRNAs) following their recent discovery in a number of organisms including humans and characterization of their functional and regulatory roles in variety of distinct cellular mechanisms. The recent annotations of long ncRNAs in humans peg their numbers as similar to protein-coding genes. However, despite the rapid advancements in the field the functional characterization and biological roles of most of the long ncRNAs still remain unidentified, although some candidate long ncRNAs have been extensively studied for their roles in cancers and biological phenomena such as X-inactivation and epigenetic regulation of genes. A number of recent reports suggest an exciting possibility of long ncRNAs mediating host response and immune function, suggesting an elaborate network of regulatory interactions mediated through ncRNAs in infection. The present role of long ncRNAs in host-pathogen cross talk is limited to a handful of mechanistically distinct examples. The current commentary chronicles the findings of these reports on the role of long ncRNAs in infection biology and further highlights the bottlenecks and future directions toward understanding the biological significance of the role of long ncRNAs in infection biology.
long non-coding RNA; infection; pathogen; immune; host-pathogen interactions