Nonalcoholic fatty liver disease (NAFLD) is a common liver disorder that currently lacks effective treatment. Berberine (BBR), a botanic compound isolated from traditional Chinese medicine, exhibits a potent therapeutic potential for the metabolic disease. The current study aimed to understand the mechanisms underlying the therapeutic effect of BBR in NAFLD.
We performed systematical analyses on hepatic expression profiles of mRNAs and long noncoding RNAs (lncRNAs) in a high-fat diet (HFD)-induced steatotic animal model with or without BBR treatment. The study was conducted by using the methods of bioinformatics, including hierarchical clustering, gene enrichment and gene co-expression networks analysis. The effect of BBR on the expression profile of some interesting genes was confirmed by quantitative RT-PCR and further studied in a human hepatic cell line, Huh7.
We found that a large group of genes including 881 mRNAs and 538 lncRNAs whose expression in the steatotic liver was reversed by BBR treatment, suggesting a global effect of BBR in modulating hepatic gene expression profiles. Among the BBR-regulated genes, we identified several modules and numerous significant genes that were associated with liver metabolism and NAFLD-related functions. Specifically, a conserved lncRNA, MRAK052686, was found strongly correlated with the antioxidant factor Nrf2, and both genes were down-regulated by the steatotic liver. Moreover, the reduced expression of MRAK052686 and Nrf2 was completely reversed by BBR treatment, suggesting a new mechanism accounting for the therapeutic effect of BBR.
The findings for the first time provide a new genetic insight into the pharmaceutical mechanism of BBR in protecting against NAFLD.
Electronic supplementary material
The online version of this article (doi:10.1186/s12967-015-0383-6) contains supplementary material, which is available to authorized users.
Berberine; NAFLD; lncRNAs; MRAK052686; Nrf2
Gene regulatory networks (GRNs) coherently coordinate the expressions of genes and control the behaviors of cellular systems. The complexity in modeling a quantitative GRN usually results from inaccurate parameter estimation, which is mostly due to small sample sizes. For better modeling of GRNs, we have designed a small-sample iterative optimization algorithm (SSIO) to quantitatively model GRNs with nonlinear regulatory relationships. The algorithm utilizes gene expression data as the primary input and it can be applied in case of small-sized samples. Using SSIO, we have quantitatively constructed the dynamic models for the GRNs controlling human and mouse adipogenesis. Compared with two other commonly-used methods, SSIO shows better performance with relatively lower residual errors, and it generates rational predictions on the adipocyte responses to external signals and steady-states. Sensitivity analysis further indicates the validity of our method. Several differences are observed between the GRNs of human and mouse adipocyte differentiations, suggesting the differences in regulatory efficiencies of the transcription factors between the two species. In addition, we use SSIO to quantitatively determine the strengths of the regulatory interactions as well as to optimize regulatory models. The results indicate that SSIO facilitates better investigation and understanding of gene regulatory processes.
In eukaryotic genomes, about 10% of genes are arranged in a head-to-head (H2H) orientation, and the distance between the transcription start sites of each gene pair is closer than 1 kb. Two genes in an H2H pair are prone to co-express and co-function. There have been many studies on bidirectional promoters. However, the mechanism by which H2H genes are regulated at the transcriptional level still needs further clarification, especially with regard to the co-regulation of H2H pairs. In this study, we first used the Hi-C data of chromatin linkages to identify spatially interacting H2H pairs, and then integrated ChIP-seq data to compare H2H gene pairs with and without evidence of spatial interactions in terms of their binding transcription factors (TFs). Using ChIP-seq and DNase-seq data, histones and DNase associated with H2H pairs were identified. Furthermore, we looked into the connections between H2H genes in a human co-expression network.
We found that i) Similar to the behaviour of two genes within an H2H pair (intra-H2H pair), a gene pair involving two distinct H2H pairs (inter-H2H pair) which interact with each other spatially, share common transcription factors (TFs); ii) TFs of intra- and inter-H2H pairs are distributed differently. Factors such as HEY1, GABP, Sin3Ak-20, POL2, E2F6, and c-MYC are essential for the bidirectional transcription of intra-H2H pairs; while factors like CTCF, BDP1, GATA2, RAD21, and POL3 play important roles in coherently regulating inter-H2H pairs; iii) H2H gene blocks are enriched with hypersensitive DNase and modified histones, which participate in active transcriptions; and iv) H2H genes tend to be highly connected compared with non-H2H genes in the human co-expression network.
Our findings shed new light on the mechanism of the transcriptional regulation of H2H genes through their linear and spatial interactions. For intra-H2H gene pairs, transcription factors regulate their transcriptions through bidirectional promoters, whereas for inter-H2H gene pairs, transcription factors are likely to regulate their activities depending on the spatial interaction of H2H gene pairs. In this way, two distinctive groups of transcription factors mediate intra- and inter-H2H gene transcriptions respectively, resulting in a highly compact gene regulatory network.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-519) contains supplementary material, which is available to authorized users.
Transcriptional regulatory network (TRN) is used to study conditional regulatory relationships between transcriptional factors and genes. However few studies have tried to integrate genomic variation information such as copy number variation (CNV) with TRN to find causal disturbances in a network. Intrahepatic cholangiocarcinoma (ICC) is the second most common hepatic carcinoma with high malignancy and poor prognosis. Research about ICC is relatively limited comparing to hepatocellular carcinoma, and there are no approved gene therapeutic targets yet.
We first constructed TRN of ICC (ICC-TRN) using forward-and-reverse combined engineering method, and then integrated copy number variation information with ICC-TRN to select CNV-related modules and constructed CNV-ICC-TRN. We also integrated CNV-ICC-TRN with KEGG signaling pathways to investigate how CNV genes disturb signaling pathways. At last, unsupervised clustering method was applied to classify samples into distinct classes.
We obtained CNV-ICC-TRN containing 33 modules which were enriched in ICC-related signaling pathways. Integrated analysis of the regulatory network and signaling pathways illustrated that CNV might interrupt signaling through locating on either genomic sites of nodes or regulators of nodes in a signaling pathway. In the end, expression profiles of nodes in CNV-ICC-TRN were used to cluster the ICC patients into two robust groups with distinct biological function features.
Our work represents a primary effort to construct TRN in ICC, also a primary effort to try to identify key transcriptional modules based on their involvement of genetic variations shown by gene copy number variations (CNV). This kind of approach may bring the traditional studies of TRN based only on expression data one step further to genetic disturbance. Such kind of approach can easily be extended to other disease samples with appropriate data.
Viral infections result in millions of deaths in the world today. A thorough analysis of virus-host interactomes may reveal insights into viral infection and pathogenic strategies. In this study, we presented a landscape of virus-host interactomes based on protein domain interaction. Compared to the analysis at protein level, this domain-domain interactome provided a unique abstraction of protein-protein interactome. Through comparisons among DNA, RNA, and retrotranscribing viruses, we identified a core of human domains, that viruses used to hijack the cellular machinery and evade the immune system, which might be promising antiviral drug targets. We showed that viruses preferentially interacted with host hub and bottleneck domains, and the degree and betweenness centrality among three categories of viruses are significantly different. Further analysis at functional level highlighted that different viruses perturbed the host cellular molecular network by common and unique strategies. Most importantly, we creatively proposed a viral disease network among viral domains, human domains and the corresponding diseases, which uncovered several unknown virus-disease relationships that needed further verification. Overall, it is expected that the findings will help to deeply understand the viral infection and contribute to the development of antiviral therapy.
Analysis of the biological pathways involved in complex human diseases is an important step in elucidating the pathogenesis and mechanism of diseases. Most pathway analysis approaches identify disease-related biological pathways using overlapping genes between pathways and diseases. However, these approaches ignore the functional biological association between pathways and diseases. In this paper, we designed a novel computational framework for prioritising disease-risk pathways based on functional profiling. The disease gene set and biological pathways were translated into functional profiles in the context of GO annotations. We then implemented a semantic similarity measurement for calculating the concordance score between a functional profile of disease genes and a functional profile of pathways (FPP); the concordance score was then used to prioritise and infer disease-risk pathways. A freely accessible web toolkit, ‘Functional Profiling-based Pathway Prioritisation' (FPPP), was developed (http://bioinfo.hrbmu.edu.cn/FPPP). During validation, our method successfully identified known disease–pathway pairs with area under the ROC curve (AUC) values of 96.73 and 95.02% in tests using both pathway randomisation and disease randomisation. A robustness analysis showed that FPPP is reliable even when using data containing noise. A case study based on a dilated cardiomyopathy data set indicated that the high-ranking pathways from FPPP are well known to be linked with this disease. Furthermore, we predicted the risk pathways of 413 diseases by using FPPP to build a disease similarity landscape that systematically reveals the global modular organisation of disease associations.
complex human diseases; prioritising risk pathways; functional profiling; concordance score
Post-translational modifications (PTMs) of proteins play essential roles in almost all cellular processes, and are closely related to physiological activity and disease development of living organisms. The development of tandem mass spectrometry (MS/MS) has resulted in a rapid increase of PTMs identified on proteins from different species. The collection and systematic ordering of PTM data should provide invaluable information for understanding cellular processes and signaling pathways regulated by PTMs. For this original purpose we developed SysPTM, a systematic resource installed with comprehensive PTM data and a suite of web tools for annotation of PTMs in 2009. Four years later, there has been a significant advance with the generation of PTM data and, consequently, more sophisticated analysis requirements have to be met. Here we submit an updated version of SysPTM 2.0 (http://lifecenter.sgst.cn/SysPTM/), with almost doubled data content, enhanced web-based analysis tools of PTMBlast, PTMPathway, PTMPhylog, PTMCluster. Moreover, a new session SysPTM-H is constructed to graphically represent the combinatorial histone PTMs and dynamic regulation of histone modifying enzymes, and a new tool PTMGO is added for functional annotation and enrichment analysis. SysPTM 2.0 not only facilitates resourceful annotation of PTM sites but allows systematic investigation of PTM functions by the user.
Citation details: Li,J., Jia,J., Li,H. et al. SysPTM 2.0: an updated systematic resource for post-translational modification. Database (2014) Vol. 2014: article ID bau025; doi:10.1093/database/bau025.
Lysine acetylation is a crucial type of protein post-translational modification, which is involved in many important cellular processes and serious diseases. However, identification of protein acetylated sites through traditional experiment methods is time-consuming and laborious. Those methods are not suitable to identify a large number of acetylated sites quickly. Therefore, computational methods are still very valuable to accelerate lysine acetylated site finding.
In this study, many biological characteristics of acetylated sites have been investigated, such as the amino acid sequence around the acetylated sites, the physicochemical property of the amino acids and the transition probability of adjacent amino acids. A logistic regression method was then utilized to integrate these information for generating a novel lysine acetylation prediction system named LAceP. When compared with existing methods, LAceP overwhelms most of state-of-the-art methods. Especially, LAceP has a more balanced prediction capability for positive and negative datasets.
LAceP can integrate different biological features to predict lysine acetylation with high accuracy. An online web server is freely available at http://www.scbit.org/iPTM/.
Substantial progress has been made in identification of type 2 diabetes (T2D) risk loci in the past few years, but our understanding of the genetic basis of T2D in ethnically diverse populations remains limited. We performed a genome-wide association study and a replication study in Chinese Hans comprising 8,569 T2D case subjects and 8,923 control subjects in total, from which 10 single nucleotide polymorphisms were selected for further follow-up in a de novo replication sample of 3,410 T2D case and 3,412 control subjects and an in silico replication sample of 6,952 T2D case and 11,865 control subjects. Besides confirming seven established T2D loci (CDKAL1, CDKN2A/B, KCNQ1, CDC123, GLIS3, HNF1B, and DUSP9) at genome-wide significance, we identified two novel T2D loci, including G-protein–coupled receptor kinase 5 (GRK5) (rs10886471: P = 7.1 × 10−9) and RASGRP1 (rs7403531: P = 3.9 × 10−9), of which the association signal at GRK5 seems to be specific to East Asians. In nondiabetic individuals, the T2D risk-increasing allele of RASGRP1-rs7403531 was also associated with higher HbA1c and lower homeostasis model assessment of β-cell function (P = 0.03 and 0.0209, respectively), whereas the T2D risk-increasing allele of GRK5-rs10886471 was also associated with higher fasting insulin (P = 0.0169) but not with fasting glucose. Our findings not only provide new insights into the pathophysiology of T2D, but may also shed light on the ethnic differences in T2D susceptibility.
Tandem mass spectrometry (MS/MS) technology has been applied to identify proteins, as an ultimate approach to confirm the original genome annotation. To be able to identify gene fusion proteins, a special database containing peptides that cross over gene fusion breakpoints is needed.
It is impractical to construct a database that includes all possible fusion peptides originated from potential breakpoints. Focusing on 6259 reported and predicted gene fusion pairs from ChimerDB 2.0 and Cancer Gene Census, we for the first time created a database CanProFu that comprehensively annotates fusion peptides formed by exon-exon linkage between these pairing genes.
Applying this database to mass spectrometry datasets of 40 human non-small cell lung cancer (NSCLC) samples and 39 normal lung samples with stringent searching criteria, we were able to identify 19 unique fusion peptides characterizing gene fusion events. Among them 11 gene fusion events were only found in NSCLC samples. And also, 4 alternative splicing events were characterized in cancerous or normal lung samples.
The database and workflow in this work can be flexibly applied to other MS/MS based human cancer experiments to detect gene fusions as potential disease biomarkers or drug targets.
The gene Polymorphic derived intron-containing, known as Pldi, is a long non-coding RNA (lncRNA) first discovered in mouse. Although parts of its sequence were reported to be conserved in rat and human, it can only be expressed in mouse testis with a mouse-specific transcription start site. The consensus sequence of Pldi is also part of an antisense transcript AK158810 expressed in a wide range of mouse tissues.
We focused on sequence origin of Pldi and Ak158810. We demonstrated that their sequence was originated from an inter-genic region and is only presented in mammalians. Transposable events and chromosome rearrangements were involved in the evolution of ancestral sequence. Moreover, we discovered high conservation in part of this region was correlated with chromosome rearrangements, CpG demethylation and transcriptional factor binding motif. These results demonstrated that multiple factors contributed to the sequence origin of Pldi.
We comprehensively analyzed the sequence origin of Pldi-Ak158810 loci. We provided various factors, including rearrangement, transposable elements, contributed to the formation of the sequence.
Overlapping transcripts; Sequence Origin of Pldi and Ak158810 loci; Conserved Element; Substitution rate
Connections between inflammation and diseases are suggested important in understanding the genetic mechanisms of diseases. However, studies on the functional cross-links between inflammation and disease genes are still in their early stages. We integrated the protein–protein interaction (PPI), inflammation genes, and gene–disease associations to construct a disease-inflammation network (DIN). We found that nodes, which are both inflammation and disease genes (namely inter-genes), are topologically important in the DIN structure. Via mapping inter-genes to PPI, we classified diseases into two categories, which are significantly different in Intimacy measuring the contribution of inflammation genes to the connections between disease pairs. Furthermore, we constructed a cross-talking subpathways network. As indicated, the cross-subpathway analysis shows great performance in capturing higher-level relationship among inflammation and disease processes. Collectively, The network-based analysis provides us a rather promising insight into the intricate relationship between inflammation and disease genes.
Epitope-antibody-reactivities (EAR) of intravenous immunoglobulins (IVIGs) determined for 75,534 peptides by microarray analysis demonstrate that roughly 9% of peptides derived from 870 different human protein sequences react with antibodies present in IVIG. Computational prediction of linear B cell epitopes was conducted using machine learning with an ensemble of classifiers in combination with position weight matrix (PWM) analysis. Machine learning slightly outperformed PWM with area under the curve (AUC) of 0.884 vs. 0.849. Two different types of epitope-antibody recognition-modes (Type I EAR and Type II EAR) were found. Peptides of Type I EAR are high in tyrosine, tryptophan and phenylalanine, and low in asparagine, glutamine and glutamic acid residues, whereas for peptides of Type II EAR it is the other way around. Representative crystal structures present in the Protein Data Bank (PDB) of Type I EAR are PDB 1TZI and PDB 2DD8, while PDB 2FD6 and 2J4W are typical for Type II EAR. Type I EAR peptides share predicted propensities for being presented by MHC class I and class II complexes. The latter interaction possibly favors T cell-dependent antibody responses including IgG class switching. Peptides of Type II EAR are predicted not to be preferentially presented by MHC complexes, thus implying the involvement of T cell-independent IgG class switch mechanisms. The high extent of IgG immunoglobulin reactivity with human peptides implies that circulating IgG molecules are prone to bind to human protein/peptide structures under non-pathological, non-inflammatory conditions. A webserver for predicting EAR of peptide sequences is available at www.sysmed-immun.eu/EAR.
It is known that chromatin features such as histone modifications and the binding of transcription factors exert a significant impact on the “openness” of chromatin. In this study, we present a quantitative analysis of the genome-wide relationship between chromatin features and chromatin accessibility in DNase I hypersensitive sites. We found that these features show distinct preference to localize in open chromatin. In order to elucidate the exact impact, we derived quantitative models to directly predict the “openness” of chromatin using histone modification features and transcription factor binding features, respectively. We show that these two types of features are highly predictive for chromatin accessibility in a statistical viewpoint. Moreover, our results indicate that these features are highly redundant and only a small number of features are needed to achieve a very high predictive power. Our study provides new insights into the true biological phenomena and the combinatorial effects of chromatin features to differential DNase I hypersensitivity.
Schistosoma japonicum is a parasitic flatworm that causes human schistosomiasis, a significant cause of morbidity in China and the Philippines. Here we present a draft genomic sequence for the worm, which is the first reported for any flatworm, indeed for the superphylum Lophotrochozoa. The genome provides a global insight into the molecular architecture and host interaction of this complex metazoan pathogen, revealing that it can exploit host nutrients, neuroendocrine hormones and signaling pathways for growth, development and maturation. Having a complex nervous system and a well developed sensory system, S. japonicum can accept stimulation of the corresponding ligands as a physiological response to different environments, such as fresh water or the tissues of its intermediate and mammalian hosts. Numerous proteinases, including cercarial elastase, are implicated in mammalian skin penetration and haemoglobin degradation. The genomic information will serve as a valuable platform to facilitate development of new interventions for schistosomiasis control.
Metabolomics helps to identify links between environmental exposures and intermediate biomarkers of disturbed pathways. We previously reported variations in phosphatidylcholines in male smokers compared with non-smokers in a cross-sectional pilot study with a small sample size, but knowledge of the reversibility of smoking effects on metabolite profiles is limited. Here, we extend our metabolomics study with a large prospective study including female smokers and quitters.
Using targeted metabolomics approach, we quantified 140 metabolite concentrations for 1,241 fasting serum samples in the population-based Cooperative Health Research in the Region of Augsburg (KORA) human cohort at two time points: baseline survey conducted between 1999 and 2001 and follow-up after seven years. Metabolite profiles were compared among groups of current smokers, former smokers and never smokers, and were further assessed for their reversibility after smoking cessation. Changes in metabolite concentrations from baseline to the follow-up were investigated in a longitudinal analysis comparing current smokers, never smokers and smoking quitters, who were current smokers at baseline but former smokers by the time of follow-up. In addition, we constructed protein-metabolite networks with smoking-related genes and metabolites.
We identified 21 smoking-related metabolites in the baseline investigation (18 in men and six in women, with three overlaps) enriched in amino acid and lipid pathways, which were significantly different between current smokers and never smokers. Moreover, 19 out of the 21 metabolites were found to be reversible in former smokers. In the follow-up study, 13 reversible metabolites in men were measured, of which 10 were confirmed to be reversible in male quitters. Protein-metabolite networks are proposed to explain the consistent reversibility of smoking effects on metabolites.
We showed that smoking-related changes in human serum metabolites are reversible after smoking cessation, consistent with the known cardiovascular risk reduction. The metabolites identified may serve as potential biomarkers to evaluate the status of smoking cessation and characterize smoking-related diseases.
metabolic network; metabolomics; molecular epidemiology; smoking; smoking cessation
Bacillus licheniformis CGMCC3963 is an important mao-tai flavor-producing strain. It was isolated from the starter (Daqu) of a Chinese mao-tai-flavor liquor fermentation process with solid-state fermentation. We report its genome of 4,525,096 bp here. Many potential insertion genes that are responsible for the unique properties of B. licheniformis CGMCC3963 in mao-tai-flavor liquor production were identified.
The emergence of vertebrates is characterized by a strong increase in miRNA families. MicroRNAs interact broadly with many transcripts, and the evolution of such a system is intriguing. However, evolutionary questions concerning the origin of miRNA genes and their subsequent evolution remain unexplained.
In order to systematically understand the evolutionary relationship between miRNAs gene and their function, we classified human known miRNAs into eight groups based on their evolutionary ages estimated by maximum parsimony method. New miRNA genes with new functional sequences accumulated more dynamically in vertebrates than that observed in Drosophila. Different levels of evolutionary selection were observed over miRNA gene sequences with different time of origin. Most genic miRNAs differ from their host genes in time of origin, there is no particular relationship between the age of a miRNA and the age of its host genes, genic miRNAs are mostly younger than the corresponding host genes. MicroRNAs originated over different time-scales are often predicted/verified to target the same or overlapping sets of genes, opening the possibility of substantial functional redundancy among miRNAs of different ages. Higher degree of tissue specificity and lower expression level was found in young miRNAs.
Our data showed that compared with protein coding genes, miRNA genes are more dynamic in terms of emergence and decay. Evolution patterns are quite different between miRNAs of different ages. MicroRNAs activity is under tight control with well-regulated expression increased and targeting decreased over time. Our work calls attention to the study of miRNA activity with a consideration of their origin time.
Sequencing of bacterial genomes became an essential approach to study pathogen virulence and the phylogenetic relationship among close related strains. Bacterium Enterococcus faecium emerged as an important nosocomial pathogen that were often associated with resistance to common antibiotics in hospitals. With highly divergent gene contents, it presented a challenge to the next generation sequencing (NGS) technologies featuring high-throughput and shorter read-length. This study was designed to investigate the properties and systematic biases of NGS technologies and evaluate critical parameters influencing the outcomes of hybrid assemblies using combinations of NGS data.
A hospital strain of E. faecium was sequenced using three different NGS platforms: 454 GS-FLX, Illumina GAIIx, and ABI SOLiD4.0, to approximately 28-, 500-, and 400-fold coverage depth. We built a pipeline that merged contigs from each NGS data into hybrid assemblies. The results revealed that each single NGS assembly had a ceiling in continuity that could not be overcome by simply increasing data coverage depth. Each NGS technology displayed some intrinsic properties, i.e. base calling error, systematic bias, etc. The gaps and low coverage regions of each NGS assembly were associated with lower GC contents. In order to optimize the hybrid assembly approach, we tested with varying amount and different combination of NGS data, and obtained optimal conditions for assembly continuity. We also, for the first time, showed that SOLiD data could help make much improved assemblies of E. faecium genome using the hybrid approach when combined with other type of NGS data.
The current study addressed the difficult issue of how to most effectively construct a complete microbial genome using today's state of the art sequencing technologies. We characterized the sequence data and genome assembly from each NGS technologies, tested conditions for hybrid assembly with combinations of NGS data, and obtained optimized parameters for achieving most cost-efficiency assembly. Our study helped form some guidelines to direct genomic work on other microorganisms, thus have important practical implications.
Animal models are indispensable tools in studying the cause of human diseases and searching for the treatments. The scientific value of an animal model depends on the accurate mimicry of human diseases. The primary goal of the current study was to develop a cross-species method by using the animal models' expression data to evaluate the similarity to human diseases' and assess drug molecules' efficiency in drug research. Therefore, we hoped to reveal that it is feasible and useful to compare gene expression profiles across species in the studies of pathology, toxicology, drug repositioning, and drug action mechanism.
We developed a cross-species analysis method to analyze animal models' similarity to human diseases and effectiveness in drug research by utilizing the existing animal gene expression data in the public database, and mined some meaningful information to help drug research, such as potential drug candidates, possible drug repositioning, side effects and analysis in pharmacology. New animal models could be evaluated by our method before they are used in drug discovery.
We applied the method to several cases of known animal model expression profiles and obtained some useful information to help drug research. We found that trichostatin A and some other HDACs could have very similar response across cell lines and species at gene expression level. Mouse hypoxia model could accurately mimic the human hypoxia, while mouse diabetes drug model might have some limitation. The transgenic mouse of Alzheimer was a useful model and we deeply analyzed the biological mechanisms of some drugs in this case. In addition, all the cases could provide some ideas for drug discovery and drug repositioning.
We developed a new cross-species gene expression module comparison method to use animal models' expression data to analyse the effectiveness of animal models in drug research. Moreover, through data integration, our method could be applied for drug research, such as potential drug candidates, possible drug repositioning, side effects and information about pharmacology.
Bacterial 16S Ribosomal RNAs profiling have been widely used in the classification of microbiota associated diseases. Dimensionality reduction is among the keys in mining high-dimensional 16S rRNAs' expression data. High levels of sparsity and redundancy are common in 16S rRNA gene microbial surveys. Traditional feature selection methods are generally restricted to measuring correlated abundances, and are limited in discrimination when so few microbes are actually shared across communities.
Here we present a Feature Merging and Selection algorithm (FMS) to deal with 16S rRNAs' expression data. By integrating Linear Discriminant Analysis method, FMS can reduce the feature dimension with higher accuracy and preserve the relationship between different features as well. Two 16S rRNAs' expression datasets of pneumonia and dental decay patients were used to test the validity of the algorithm. Combined with SVM, FMS discriminated different classes of both pneumonia and dental caries better than other popular feature selection methods.
FMS projects data into lower dimension with preservation of enough features, and thus improve the intelligibility of the result. The results showed that FMS is a more valid and reliable methods in feature reduction.
Hepatocellular carcinoma (HCC) is one of the most fatal cancers in the world, and metastasis is a significant cause to the high mortality in patients with HCC. However, the molecular mechanism behind HCC metastasis is not fully understood. Study of regulatory networks may help investigate HCC metastasis in the way of systems biology profiling.
By utilizing both sequence information and parallel microRNA(miRNA) and mRNA expression data on the same cohort of HBV related HCC patients without or with venous metastasis, we constructed combinatorial regulatory networks of non-metastatic and metastatic HCC which contain transcription factor(TF) regulation and miRNA regulation. Differential regulation patterns, classifying marker modules, and key regulatory miRNAs were analyzed by comparing non-metastatic and metastatic networks.
Globally TFs accounted for the main part of regulation while miRNAs for the minor part of regulation. However miRNAs displayed a more active role in the metastatic network than in the non-metastatic one. Seventeen differential regulatory modules discriminative of the metastatic status were identified as cumulative-module classifier, which could also distinguish survival time. MiR-16, miR-30a, Let-7e and miR-204 were identified as key miRNA regulators contributed to HCC metastasis.
In this work we demonstrated an integrative approach to conduct differential combinatorial regulatory network analysis in the specific context venous metastasis of HBV-HCC. Our results proposed possible transcriptional regulatory patterns underlying the different metastatic subgroups of HCC. The workflow in this study can be applied in similar context of cancer research and could also be extended to other clinical topics.
Detecting the borders between coding and non-coding regions is an essential step in the genome annotation. And information entropy measures are useful for describing the signals in genome sequence. However, the accuracies of previous methods of finding borders based on entropy segmentation method still need to be improved.
In this study, we first applied a new recursive entropic segmentation method on DNA sequences to get preliminary significant cuts. A 22-symbol alphabet is used to capture the differential composition of nucleotide doublets and stop codon patterns along three phases in both DNA strands. This process requires no prior training datasets.
Comparing with the previous segmentation methods, the experimental results on three bacteria genomes, Rickettsia prowazekii, Borrelia burgdorferi and E.coli, show that our approach improves the accuracy for finding the borders between coding and non-coding regions in DNA sequences.
This paper presents a new segmentation method in prokaryotes based on Jensen-Rényi divergence with a 22-symbol alphabet. For three bacteria genomes, comparing to A12_JR method, our method raised the accuracy of finding the borders between protein coding and non-coding regions in DNA sequences.
The C4 photosynthetic cycle supercharges photosynthesis by concentrating CO2 around ribulose-1,5-bisphosphate carboxylase and significantly reduces the oxygenation reaction. Therefore engineering C4 feature into C3 plants has been suggested as a feasible way to increase photosynthesis and yield of C3 plants, such as rice, wheat, and potato. To identify the possible transition from C3 to C4 plants, the systematic comparison of C3 and C4 metabolism is necessary.
We compared C3 and C4 metabolic networks using the improved constraint-based models for Arabidopsis and maize. By graph theory, we found the C3 network exhibit more dense topology structure than C4. The simulation of enzyme knockouts demonstrated that both C3 and C4 networks are very robust, especially when optimizing CO2 fixation. Moreover, C4 plant has better robustness no matter the objective function is biomass synthesis or CO2 fixation. In addition, all the essential reactions in C3 network are also essential for C4, while there are some other reactions specifically essential for C4, which validated that the basic metabolism of C4 plant is similar to C3, but C4 is more complex. We also identified more correlated reaction sets in C4, and demonstrated C4 plants have better modularity with complex mechanism coordinates the reactions and pathways than that of C3 plants. We also found the increase of both biomass production and CO2 fixation with light intensity and CO2 concentration in C4 is faster than that in C3, which reflected more efficient use of light and CO2 in C4 plant. Finally, we explored the contribution of different C4 subtypes to biomass production by setting specific constraints.
All results are consistent with the actual situation, which indicate that Flux Balance Analysis is a powerful method to study plant metabolism at systems level. We demonstrated that in contrast to C3, C4 plants have less dense topology, higher robustness, better modularity, and higher CO2 and radiation use efficiency. In addition, preliminary analysis indicated that the rate of CO2 fixation and biomass production in PCK subtype are superior to NADP-ME and NAD-ME subtypes under enough supply of water and nitrogen.
The new release of SchistoDB (http://SchistoDB.net) provides a rich resource of genomic data for key blood flukes (genus Schistosoma) which cause disease in hundreds of millions of people worldwide. SchistoDB integrates whole-genome sequence and annotation of three species of the genus and provides enhanced bioinformatics analyses and data-mining tools. A simple, yet comprehensive web interface provided through the Strategies Web Development Kit is available for the mining and visualization of the data. Genomic scale data can be queried based on BLAST searches, annotation keywords and gene ID searches, gene ontology terms, sequence motifs, protein characteristics and phylogenetic relationships. Search strategies can be saved within a user’s profile for future retrieval and may also be shared with other researchers using a unique web address.