Publicly available multi-omic databases, in particular if associated with medical annotations, are rich resources with the potential to lead a rapid transition from high-throughput molecular biology experiments to better clinical outcomes for patients. In this work, we propose a model for multi-omic data integration (i.e., genetic variations, gene expression, genome conformation, and epigenetic patterns), which exploits a multi-layer network approach to analyse, visualize, and obtain insights from such biological information, in order to use achieved results at a macroscopic level. Using this representation, we can describe how driver and passenger mutations accumulate during the development of diseases providing, for example, a tool able to characterize the evolution of cancer. Indeed, our test case concerns the MCF-7 breast cancer cell line, before and after the stimulation with estrogen, since many datasets are available for this case study. In particular, the integration of data about cancer mutations, gene functional annotations, genome conformation, epigenetic patterns, gene expression, and metabolic pathways in our multi-layer representation will allow a better interpretation of the mechanisms behind a complex disease such as cancer. Thanks to this multi-layer approach, we focus on the interplay of chromatin conformation and cancer mutations in different pathways, such as metabolic processes, that are very important for tumor development. Working on this model, a variance analysis can be implemented to identify normal variations within each omics and to characterize, by contrast, variations that can be accounted to pathological samples compared to normal ones. This integrative model can be used to identify novel biomarkers and to provide innovative omic-based guidelines for treating many diseases, improving the efficacy of decision trees currently used in clinic.
gene functional annotations; chromosome conformation capture; metabolic pathways; epigenetic patterns; gene expression; cancer mutations; multi-layer networks; damage spreading
During library construction polymerase chain reaction is used to enrich the DNA before sequencing. Typically, this process generates duplicate read sequences. Removal of these artifacts is mandatory, as they can affect the correct interpretation of data in several analyses. Ideally, duplicate reads should be characterized by identical nucleotide sequences. However, due to sequencing errors, duplicates may also be nearly-identical. Removing nearly-identical duplicates can result in a notable computational effort. To deal with this challenge, we recently proposed a GPU method aimed at removing identical and nearly-identical duplicates generated with an Illumina platform.
The method implements an approach based on prefix-suffix comparison. Read sequences with identical prefix are considered potential duplicates. Then, their suffixes are compared to identify and remove those that are actually duplicated.
Although the method can be efficiently used to remove duplicates, there are some limitations that need to be overcome. In particular, it cannot to detect potential duplicates in the event that prefixes are longer than 27 bases, and it does not provide support for paired-end read libraries. Moreover, large clusters of potential duplicates are split into smaller with the aim to guarantees a reasonable computing time. This heuristic may affect the accuracy of the analysis.
In this work we propose GPU-DupRemoval, a new implementation of our method able to (i) cluster reads without constraints on the maximum length of the prefixes, (ii) support both single- and paired-end read libraries, and (iii) analyze large clusters of potential duplicates.
Due to the massive parallelization obtained by exploiting graphics cards, GPU-DupRemoval removes duplicate reads faster than other cutting-edge solutions, while outperforming most of them in terms of amount of duplicates reads.
Next generation sequencing; Duplicate reads; Graphics processing units; CUDA
This preface introduces the content of the BioMed Central journal Supplements related to the BITS 2015 meeting, held in Milan, Italy, from the 3th to the 5th of June, 2015.
BITS; Bioinformatics; Italian Society of Bioinformatics meeting
Alzheimer′s disease has recently emerged as a possible field of application for PDE4D inhibitors (PDE4DIs). The great structure similarity among the various PDE4 isoforms and, furthermore, the lack of the full length crystal structure of the enzyme, impaired the rational design of new selective PDE4DIs. In this paper, with the aim of exploring new insights into the PDE4D binding, we tackled the problem by performing a computational study based on docking simulations combined with molecular dynamics (D‐MD). Our work uniquely identified the binding mode and the key residues involved in the interaction with a number of in‐house catechol iminoether derivatives, acting as PDE4DIs. Moreover, the new binding mode was tested using a series of analogues previously reported by us and it was used to confirm their key structural features to allow PDE4D inhibition. The binding model disclosed within the current computational study may prove to be useful to further advance the design and synthesis of novel, more potent and selective, PDE4D inhibitors.
Phosphodiesterases; PDE4; PDE4D selective inhibitors; docking; molecular dynamics
A relation exists between network proximity of molecular entities in interaction networks, functional similarity and association with diseases. The identification of network regions associated with biological functions and pathologies is a major goal in systems biology. We describe a network diffusion-based pipeline for the interpretation of different types of omics in the context of molecular interaction networks. We introduce the network smoothing index, a network-based quantity that allows to jointly quantify the amount of omics information in genes and in their network neighbourhood, using network diffusion to define network proximity. The approach is applicable to both descriptive and inferential statistics calculated on omics data. We also show that network resampling, applied to gene lists ranked by quantities derived from the network smoothing index, indicates the presence of significantly connected genes. As a proof of principle, we identified gene modules enriched in somatic mutations and transcriptional variations observed in samples of prostate adenocarcinoma (PRAD). In line with the local hypothesis, network smoothing index and network resampling underlined the existence of a connected component of genes harbouring molecular alterations in PRAD.
Since involved in synaptic transmission and located on X-chromosome, neuroligins 3 and 4X have been studied as good positional and functional candidate genes for autism spectrum disorder pathogenesis, although contradictory results have been reported. Here, we performed a case-control study to assess the association between noncoding genetic variants in NLGN3 and NLGN4X genes and autism, in an Italian cohort of 202 autistic children analyzed by high-resolution melting. The results were first compared with data from 379 European healthy controls (1000 Genomes Project) and then with those from 1061 Italian controls genotyped by Illumina single nucleotide polymorphism (SNP) array 1M-duo. Statistical evaluations were performed using Plink v1.07, with the Omnibus multiple loci approach. According to both the European and the Italian control groups, a 6-marker haplotype on NLGN4X (rs6638575(G), rs3810688(T), rs3810687(G), rs3810686(C), rs5916269(G), rs1882260(T)) was associated with autism (odd ratio = 3.58, p-value = 2.58 × 10−6 for the European controls; odds ratio = 2.42, p-value = 6.33 × 10−3 for the Italian controls). Furthermore, several haplotype blocks at 5-, 4-, 3-, and 2-, including the first 5, 4, 3, and 2 SNPs, respectively, showed a similar association with autism. We provide evidence that noncoding polymorphisms on NLGN4X may be associated to autism, suggesting the key role of NLGN4X in autism pathophysiology and in its male prevalence.
autism; genetics; neuroligins; SNPs; haplotype analysis; noncoding regions
Myofibrillar myopathies (MFMs) are genetically heterogeneous dystrophies characterized by the disintegration of Z-disks and myofibrils and are associated with mutations in genes encoding Z-disk or Z-disk-related proteins. The c.626 C > T (p.P209L) mutation in the BAG3 gene has been described as causative of a subtype of MFM. We report a sporadic case of a 26-year-old Italian woman, affected by MFM with axonal neuropathy, cardiomyopathy, rigid spine, who carries the c.626 C > T mutation in the BAG3 gene. The patient and her non-consanguineous healthy parents and brother were studied with whole exome sequencing (WES) to further investigate the genetic basis of this complex phenotype. In the patient, we found that the BAG3 mutation is associated with variants in the NRAP and FHL1 genes that encode muscle-specific, LIM domain containing proteins. Quantitative real time PCR, immunohistochemistry and Western blot analysis of the patient’s muscular biopsy showed the absence of NRAP expression and FHL1 accumulation in aggregates in the affected skeletal muscle tissue. Molecular dynamic analysis of the mutated FHL1 domain showed a modification in its surface charge, which could affect its capability to bind its target proteins. To our knowledge this is the first study reporting, in a BAG3 MFM, the simultaneous presence of genetic variants in the BAG3 and FHL1 genes (previously described as independently associated with MFMs) and linking the NRAP gene to MFM for the first time.
Myofibrillar myopathies; Exome sequencing; LIM proteins; BAG3
Bronchial smooth muscle (BSM) cells from asthmatic patients maintain in vitro a distinct hyper-reactive (“primed”) phenotype, characterized by increased release of pro-inflammatory factors and mediators, as well as hyperplasia and/or hypertrophy. This “primed” phenotype helps to understand pathogenesis of asthma, as changes in BSM function are essential for manifestation of allergic and inflammatory responses and airway wall remodelling.
To identify signalling pathways in cultured primary BSMs of asthma patients and non-asthmatic subjects by genome wide profiling of differentially expressed mRNAs and activated intracellular signalling pathways (ISPs).
Transcriptome profiling by cap-analysis-of-gene-expression (CAGE), which permits selection of preferentially capped mRNAs most likely to be translated into proteins, was performed in human BSM cells from asthmatic (n=8) and non-asthmatic (n=6) subjects and OncoFinder tool were then exploited for identification of ISP deregulations.
CAGE revealed >600 RNAs differentially expressed in asthma vs control cells (p≤0.005), with asthma samples showing a high degree of similarity among them. Comprehensive ISP activation analysis revealed that among 269 pathways analysed, 145 (p<0.05) or 103 (p<0.01) are differentially active in asthma, with profiles that clearly characterize BSM cells of asthmatic individuals. Notably, we identified 7 clusters of coherently acting pathways functionally related to the disease, with ISPs down-regulated in asthma mostly targeting cell death-promoting pathways and up-regulated ones affecting cell growth and proliferation, inflammatory response, control of smooth muscle contraction and hypoxia-related signalization.
These first-time results can now be exploited toward development of novel therapeutic strategies targeting ISP signatures linked to asthma pathophysiology.
asthma; smooth muscle cells; signalling pathways; CAGE
The Hsp70 is an allosterically regulated family of molecular chaperones. They consist of two structural domains, NBD and SBD, connected by a flexible linker. ATP hydrolysis at the NBD modulates substrate recognition at the SBD, while peptide binding at the SBD enhances ATP hydrolysis. In this study we apply Molecular Dynamics (MD) to elucidate the molecular determinants underlying the allosteric communication from the NBD to the SBD and back. We observe that local structural and dynamical modulation can be coupled to large-scale rearrangements, and that different combinations of ligands at NBD and SBD differently affect the SBD domain mobility. Substituting ADP with ATP in the NBD induces specific structural changes involving the linker and the two NBD lobes. Also, a SBD-bound peptide drives the linker docking by increasing the local dynamical coordination of its C-terminal end: a partially docked DnaK structure is achieved by combining ATP in the NBD and peptide in the SBD. We propose that the MD-based analysis of the inter domain dynamics and structure modulation could be used as a tool to computationally predict the allosteric behaviour and functional response of Hsp70 upon introducing mutations or binding small molecules, with potential applications for drug discovery.
Phosphorylation is one of the most important post-translational modifications (PTM) employed by cells to regulate several cellular processes. Studying the effects of phosphorylations on protein structures allows to investigate the modulation mechanisms of several proteins including chaperones, like the small HSPs, which display different multimeric structures according to the phosphorylation of a few serine residues. In this context, the proposed study is aimed at finding a method to correlate different PTM patterns (in particular phosphorylations at the monomers interface of multimeric complexes) with the dynamic behaviour of the complex, using physicochemical parameters derived from molecular dynamics simulations in the timescale of nanoseconds.
We have developed a methodology relying on computing nine physicochemical parameters, derived from the analysis of short MD simulations, and combined with N identifiers that characterize the PTMs of the analysed protein. The nine general parameters were validated on three proteins, with known post-translational modified conformation and unmodified conformation. Then, we applied this approach to the case study of αB-Crystallin, a chaperone which multimeric state (up to 40 units) is supposed to be controlled by phosphorylation of Ser45 and Ser59. Phosphorylation of serines at the dimer interface induces the release of hexamers, the active state of αB-Crystallin. 30 ns of MD simulation were obtained for each possible combination of dimer phosphorylation state and average values of structural, dynamic, energetic and functional features were calculated on the equilibrated portion of the trajectories. Principal Component Analysis was applied to the parameters and the first five Principal Components, which summed up to 84 % of the total variance, were finally considered.
The validation of this approach on multimeric proteins, which structures were known both modified and unmodified, allowed us to propose a new approach that can be used to predict the impact of PTM patterns in multi-modified proteins using data collected from short molecular dynamics simulations. Analysis on the αB-Crystallin case study clusters together all-P dimers with all-P hexamers and no-P dimer with no-P hexamer and results suggest a great influence of Ser59 phosphorylation on chain B.
Post-translational modification; Phosphorylation; Molecular dynamics; PCA; Clustering; αB-Crystallin; Chaperone; Small HSP
The culture of progenitor mesenchymal stem cells (MSC) onto osteoconductive materials to induce a proper osteogenic differentiation and mineralized matrix regeneration represents a promising and widely diffused experimental approach for tissue-engineering (TE) applications in orthopaedics. Among modern biomaterials, calcium phosphates represent the best bone substitutes, due to their chemical features emulating the mineral phase of bone tissue. Although many studies on stem cells differentiation mechanisms have been performed involving calcium-based scaffolds, results often focus on highlighting production of in vitro bone matrix markers and in vivo tissue ingrowth, while information related to the biomolecular mechanisms involved in the early cellular calcium-mediated differentiation is not well elucidated yet. Genetic programs for osteogenesis have been just partially deciphered, and the description of the different molecules and pathways operative in these differentiations is far from complete, as well as the activity of calcium in this process. The present work aims to shed light on the involvement of extracellular calcium in MSC differentiation: a better understanding of the early stage osteogenic differentiation program of MSC seeded on calcium-based biomaterials is required in order to develop optimal strategies to promote osteogenesis through the use of new generation osteoconductive scaffolds. A wide spectrum of analysis has been performed on time-dependent series: gene expression profiles are obtained from samples (MSC seeded on calcium-based scaffolds), together with related microRNAs expression and in vivo functional validation. On this basis, and relying on literature knowledge, hypotheses are made on the biomolecular players activated by the biomaterial calcium-phosphate component. Interestingly, a key role of miR-138 was highlighted, whose inhibition markedly increases osteogenic differentiation in vitro and enhance ectopic bone formation in vivo. Moreover, there is evidence that Ca-P substrate triggers osteogenic differentiation through genes (SMAD and RAS family) that are typically regulated during dexamethasone (DEX) induced differentiation.
Interest in understanding the mechanisms that lead to a particular composition of the Gut Microbiota is highly increasing, due to the relationship between this ecosystem and the host health state. Particularly relevant is the study of the Relative Species Abundance (RSA) distribution, that is a component of biodiversity and measures the number of species having a given number of individuals. It is the universal behaviour of RSA that induced many ecologists to look for theoretical explanations. In particular, a simple stochastic neutral model was proposed by Volkov et al. relying on population dynamics and was proved to fit the coral-reefs and rain forests RSA. Our aim is to ascertain if this model also describes the Microbiota RSA and if it can help in explaining the Microbiota plasticity.
We analyzed 16S rRNA sequencing data sampled from the Microbiota of three different animal species by Jeraldo et al. Through a clustering procedure (UCLUST), we built the Operational Taxonomic Units. These correspond to bacterial species considered at a given phylogenetic level defined by the similarity threshold used in the clustering procedure. The RSAs, plotted in the form of Preston plot, were fitted with Volkov’s model. The model fits well the Microbiota RSA, except in the tail region, that shows a deviation from the neutrality assumption. Looking at the model parameters we were able to discriminate between different animal species, giving also a biological explanation. Moreover, the biodiversity estimator obtained by Volkov’s model also differentiates the animal species and is in good agreement with the first and second order Hill’s numbers, that are common evenness indexes simply based on the fraction of individuals per species.
We conclude that the neutrality assumption is a good approximation for the Microbiota dynamics and the observation that Volkov’s model works for this ecosystem is a further proof of the RSA universality. Moreover, the ability to separate different animals with the model parameters and biodiversity number are promising results if we think about future applications on human data, in which the Microbiota composition and biodiversity are in close relationships with a variety of diseases and life-styles.
Microbiota; 16S RNA; OTU; RSA; Biodiversity; Ecological modelling
Methods for the integrative analysis of multi-omics data are required to draw a more complete and accurate picture of the dynamics of molecular systems. The complexity of biological systems, the technological limits, the large number of biological variables and the relatively low number of biological samples make the analysis of multi-omics datasets a non-trivial problem.
Results and Conclusions
We review the most advanced strategies for integrating multi-omics datasets, focusing on mathematical and methodological aspects.
Omics; Multi-omics; Data integration
People that reach extreme ages (Long-Living Individuals, LLIs) are object of intense investigation for increase/decrease of genetic variant frequencies, genetic methylation levels, protein abundance in serum and tissues. The aim of these studies is the discovery of the mechanisms behind LLIs extreme longevity and the identification of markers of well-being. We have recently associated a BPIFB4 haplotype (LAV) with exceptional longevity under a homozygous genetic model, and identified that CD34+ of LLIs subjects express higher BPIFB4 transcript as compared to CD34+ of control population. It would be of interest to correlate serum BPIFB4 protein levels with exceptional longevity and health status of LLIs.
Western blots on cellular medium to detect BPIFB4 secretion in transfected HEK293T cells with plasmid carrying BPIFB4 and ELISA on LLIs serum to detect BPIFB4 levels.
Here we show that BPIFB4 is a secreted protein and its levels are increased in serum of LLIs, and high BPIFB4 levels classify their health status.
Serum BPIFB4 protein levels classify longevity and health status in LLIs. Further studies are required to evaluate the possible role of BPIFB4 in monitoring disease progression.
BPIFB4; Methylation; CD34; Vascular ageing
The human genome is a mosaic of isochores, which are long (>200 kb) DNA sequences that are fairly homogeneous in base composition and can be assigned to five families comprising 33%–59% of GC composition. Although the compartmentalized organization of the mammalian genome has been investigated for more than 40 years, no satisfactory automatic procedure for segmenting the genome into isochores is available so far. We present a critical discussion of the currently available methods and a new approach called isoSegmenter which allows segmenting the genome into isochores in a fast and completely automatic manner. This approach relies on two types of experimentally defined parameters, the compositional boundaries of isochore families and an optimal window size of 100 kb. The approach represents an improvement over the existing methods, is ideally suited for investigating long-range features of sequenced and assembled genomes, and is publicly available at https://github.com/bunop/isoSegmenter.
bioinformatics; comparative genomics; evolution
In this paper comparative genome and phenotype microarray analyses of Rhodococcus sp. BCP1 and Rhodococcus opacus R7 were performed. Rhodococcus sp. BCP1 was selected for its ability to grow on short-chain n-alkanes and R. opacus R7 was isolated for its ability to grow on naphthalene and on o-xylene. Results of genome comparison, including BCP1, R7, along with other Rhodococcus reference strains, showed that at least 30% of the genome of each strain presented unique sequences and only 50% of the predicted proteome was shared. To associate genomic features with metabolic capabilities of BCP1 and R7 strains, hundreds of different growth conditions were tested through Phenotype Microarray, by using Biolog plates and plates manually prepared with additional xenobiotic compounds. Around one-third of the surveyed carbon sources was utilized by both strains although R7 generally showed higher metabolic activity values compared to BCP1. Moreover, R7 showed broader range of nitrogen and sulphur sources. Phenotype Microarray data were combined with genomic analysis to genetically support the metabolic features of the two strains. The genome analysis allowed to identify some gene clusters involved in the metabolism of the main tested xenobiotic compounds. Results show that R7 contains multiple genes for the degradation of a large set of aromatic and PAHs compounds, while a lower variability in terms of genes predicted to be involved in aromatic degradation was found in BCP1. This genetic feature can be related to the strong genetic pressure exerted by the two different environment from which the two strains were isolated. According to this, in the BCP1 genome the smo gene cluster involved in the short-chain n-alkanes degradation, is included in one of the unique regions and it is not conserved in the Rhodococcus strains compared in this work. Data obtained underline the great potential of these two Rhodococcus spp. strains for biodegradation and environmental decontamination processes.
Systems Medicine (SM) can be defined as an extension of Systems Biology (SB) to Clinical-Epidemiological disciplines through a shifting paradigm, starting from a cellular, toward a patient centered framework. According to this vision, the three pillars of SM are Biomedical hypotheses, experimental data, mainly achieved by Omics technologies and tailored computational, statistical and modeling tools. The three SM pillars are highly interconnected, and their balancing is crucial. Despite the great technological progresses producing huge amount of data (Big Data) and impressive computational facilities, the Bio-Medical hypotheses are still of primary importance. A paradigmatic example of unifying Bio-Medical theory is the concept of Inflammaging. This complex phenotype is involved in a large number of pathologies and patho-physiological processes such as aging, age-related diseases and cancer, all sharing a common inflammatory pathogenesis. This Biomedical hypothesis can be mapped into an ecological perspective capable to describe by quantitative and predictive models some experimentally observed features, such as microenvironment, niche partitioning and phenotype propagation. In this article we show how this idea can be supported by computational methods useful to successfully integrate, analyze and model large data sets, combining cross-sectional and longitudinal information on clinical, environmental and omics data of healthy subjects and patients to provide new multidimensional biomarkers capable of distinguishing between different pathological conditions, e.g. healthy versus unhealthy state, physiological versus pathological aging.
propagation; ecological model; networks; multilayer networks; inflammation; multi-scale
Autism is an increasing neurodevelopmental disease that appears by 3 years of age, has genetic and/or environmental etiology, and often shows comorbid situations, such as gastrointestinal (GI) disorders. Autism has also a striking sex-bias, not fully genetically explainable.
Our goal was to explain how and in which predisposing conditions some compounds can impair neurodevelopment, why this occurs in the first years of age, and, primarily, why more in males than females.
We reviewed articles regarding the genetic and environmental etiology of autism and toxins effects on animal models selected from PubMed and databases about autism and toxicology.
Our hypothesis proposes that in the first year of life, the decreasing of maternal immune protection and child immune-system immaturity create an immune vulnerability to infection diseases that, especially if treated with antibiotics, could facilitate dysbiosis and GI disorders. This condition triggers a vicious circle between immune system impairment and increasing dysbiosis that leads to leaky gut and neurochemical compounds and/or neurotoxic xenobiotics production and absorption. This alteration affects the ‘gut-brain axis’ communication that connects gut with central nervous system via immune system. Thus, metabolic pathways impaired in autistic children can be affected by genetic alterations or by environment–xenobiotics interference. In addition, in animal models many xenobiotics exert their neurotoxicity in a sex-dependent manner.
We integrate fragmented and multi-disciplinary information in a unique hypothesis and first disclose a possible environmental origin for the imbalance of male:female distribution of autism, reinforcing the idea that exogenous factors are related to the recent rise of this disease.
Environmental autism; Gut dysbiosis; Immune system; Sex bias; Xenobiotics
DnaK, the bacterial homolog of human Hsp70, plays an important role in pathogens survival under stress conditions, like antibiotic therapies. This chaperone sequesters protein aggregates accumulated in bacteria during antibiotic treatment reducing the effect of the cure. Although different classes of DnaK inhibitors have been already designed, they present low specificity. DnaK is highly conserved in prokaryotes (identity 50–70%), which encourages the development of a unique inhibitor for many different bacterial strains. We used the DnaK of Acinetobacter baumannii as representative for our analysis, since it is one of the most important opportunistic human pathogens, exhibits a significant drug resistance and it has the ability to survive in hospital environments. The E.coli DnaK was also included in the analysis as reference structure due to its wide diffusion. Unfortunately, bacterial DnaK and human Hsp70 have an elevated sequence similarity. Therefore, we performed a differential analysis of DnaK and Hsp70 residues to identify hot spots in bacterial proteins that are not present in the human homolog, with the aim of characterizing the key pharmacological features necessary to design selective inhibitors for DnaK. Different conformations of DnaK and Hsp70 bound to known inhibitor-peptides for DnaK, and ineffective for Hsp70, have been analysed by molecular dynamics simulations to identify residues displaying stable and selective interactions with these peptides. Results achieved in this work show that there are some residues that can be used to build selective inhibitors for DnaK, which should be ineffective for the human Hsp70.
The representation, integration, and interpretation of omic data is a complex task, in particular considering the huge amount of information that is daily produced in molecular biology laboratories all around the world. The reason is that sequencing data regarding expression profiles, methylation patterns, and chromatin domains is difficult to harmonize in a systems biology view, since genome browsers only allow coordinate-based representations, discarding functional clusters created by the spatial conformation of the DNA in the nucleus. In this context, recent progresses in high throughput molecular biology techniques and bioinformatics have provided insights into chromatin interactions on a larger scale and offer a formidable support for the interpretation of multi-omic data. In particular, a novel sequencing technique called Chromosome Conformation Capture allows the analysis of the chromosome organization in the cell’s natural state. While performed genome wide, this technique is usually called Hi–C. Inspired by service applications such as Google Maps, we developed NuChart, an R package that integrates Hi–C data to describe the chromosomal neighborhood starting from the information about gene positions, with the possibility of mapping on the achieved graphs genomic features such as methylation patterns and histone modifications, along with expression profiles. In this paper we show the importance of the NuChart application for the integration of multi-omic data in a systems biology fashion, with particular interest in cytogenetic applications of these techniques. Moreover, we demonstrate how the integration of multi-omic data can provide useful information in understanding why genes are in certain specific positions inside the nucleus and how epigenetic patterns correlate with their expression.
multi-omic data integration; Chromosome Conformation Capture; gene neighborhood map; chromatin spatial organization; linking gene regulatory elements
Copy number variations (CNVs) are the most prevalent types of structural variations (SVs) in the human genome and are involved in a wide range of common human diseases. Different computational methods have been devised to detect this type of SVs and to study how they are implicated in human diseases. Recently, computational methods based on high-throughput sequencing (HTS) are increasingly used. The majority of these methods focus on mapping short-read sequences generated from a donor against a reference genome to detect signatures distinctive of CNVs. In particular, read-depth based methods detect CNVs by analyzing genomic regions with significantly different read-depth from the other ones. The pipeline analysis of these methods consists of four main stages: (i) data preparation, (ii) data normalization, (iii) CNV regions identification, and (iv) copy number estimation. However, available tools do not support most of the operations required at the first two stages of this pipeline. Typically, they start the analysis by building the read-depth signal from pre-processed alignments. Therefore, third-party tools must be used to perform most of the preliminary operations required to build the read-depth signal. These data-intensive operations can be efficiently parallelized on graphics processing units (GPUs). In this article, we present G-CNV, a GPU-based tool devised to perform the common operations required at the first two stages of the analysis pipeline. G-CNV is able to filter low-quality read sequences, to mask low-quality nucleotides, to remove adapter sequences, to remove duplicated read sequences, to map the short-reads, to resolve multiple mapping ambiguities, to build the read-depth signal, and to normalize it. G-CNV can be efficiently used as a third-party tool able to prepare data for the subsequent read-depth signal generation and analysis. Moreover, it can also be integrated in CNV detection tools to generate read-depth signals.
CNV; GPU; HTS; read-depth; parallel
Transposable elements (TEs) are abundant in mammalian genomes and appear to have contributed to the evolution of their hosts by providing novel regulatory or coding sequences. We analyzed different regions of long intergenic non-coding RNA (lincRNA) genes in human and mouse genomes to systematically assess the potential contribution of TEs to the evolution of the structure and regulation of expression of lincRNA genes. Introns of lincRNA genes contain the highest percentage of TE-derived sequences (TES), followed by exons and then promoter regions although the density of TEs is not significantly different between exons and promoters. Higher frequencies of ancient TEs in promoters and exons compared to introns implies that many lincRNA genes emerged before the split of primates and rodents. The content of TES in lincRNA genes is substantially higher than that in protein-coding genes, especially in exons and promoter regions. A significant positive correlation was detected between the content of TEs and evolutionary rate of lincRNAs indicating that inserted TEs are preferentially fixed in fast-evolving lincRNA genes. These results are consistent with the repeat insertion domains of LncRNAs hypothesis under which TEs have substantially contributed to the origin, evolution, and, in particular, fast functional diversification, of lincRNA genes.
mobile elements; molecular domestication; exaptation; junk DNA; long non-coding RNA; repetitive elements
Hepatitis C virus infection is one of the most common and chronic in the world, and hepatitis associated with HCV infection is a major risk factor for the development of cirrhosis and hepatocellular carcinoma (HCC). The rapidly growing number of viral-host and host protein-protein interactions is enabling more and more reliable network-based analyses of viral infection supported by omics data. The study of molecular interaction networks helps to elucidate the mechanistic pathways linking HCV molecular activities and the host response that modulates the stepwise hepatocarcinogenic process from preneoplastic lesions (cirrhosis and dysplasia) to HCC. Simulating the impact of HCV-host molecular interactions throughout the host protein-protein interaction (PPI) network, we ranked the host proteins in relation to their network proximity to viral targets. We observed that the set of proteins in the neighborhood of HCV targets in the host interactome is enriched in key players of the host response to HCV infection. In opposition to HCV targets, subnetworks of proteins in network proximity to HCV targets are significantly enriched in proteins reported as differentially expressed in preneoplastic and neoplastic liver samples by two independent studies. Using multi-objective optimization, we extracted subnetworks that are simultaneously “guilt-by-association” with HCV proteins and enriched in proteins differentially expressed. These subnetworks contain established, recently proposed and novel candidate proteins for the regulation of the mechanisms of liver cells response to chronic HCV infection.
There is an increasing awareness of the pivotal role of noise in biochemical processes and of the effect of molecular crowding on the dynamics of biochemical systems. This necessity has given rise to a strong need for suitable and sophisticated algorithms for the simulation of biological phenomena taking into account both spatial effects and noise. However, the high computational effort characterizing simulation approaches, coupled with the necessity to simulate the models several times to achieve statistically relevant information on the model behaviours, makes such kind of algorithms very time-consuming for studying real systems. So far, different parallelization approaches have been deployed to reduce the computational time required to simulate the temporal dynamics of biochemical systems using stochastic algorithms. In this work we discuss these aspects for the spatial TAU-leaping in crowded compartments (STAUCC) simulator, a voxel-based method for the stochastic simulation of reaction-diffusion processes which relies on the Sτ-DPP algorithm. In particular we present how the characteristics of the algorithm can be exploited for an effective parallelization on the present heterogeneous HPC architectures.
Cytosine DNA methylation is an epigenetic mark implicated in several biological processes. Bisulfite treatment of DNA is acknowledged as the gold standard technique to study methylation. This technique introduces changes in the genomic DNA by converting cytosines to uracils while 5-methylcytosines remain nonreactive. During PCR amplification 5-methylcytosines are amplified as cytosine, whereas uracils and thymines as thymine. To detect the methylation levels, reads treated with the bisulfite must be aligned against a reference genome. Mapping these reads to a reference genome represents a significant computational challenge mainly due to the increased search space and the loss of information introduced by the treatment. To deal with this computational challenge we devised GPU-BSM, a tool based on modern Graphics Processing Units. Graphics Processing Units are hardware accelerators that are increasingly being used successfully to accelerate general-purpose scientific applications. GPU-BSM is a tool able to map bisulfite-treated reads from whole genome bisulfite sequencing and reduced representation bisulfite sequencing, and to estimate methylation levels, with the goal of detecting methylation. Due to the massive parallelization obtained by exploiting graphics cards, GPU-BSM aligns bisulfite-treated reads faster than other cutting-edge solutions, while outperforming most of them in terms of unique mapped reads.