There is an increasing awareness of the pivotal role of noise in biochemical processes and of the effect of molecular crowding on the dynamics of biochemical systems. This necessity has given rise to a strong need for suitable and sophisticated algorithms for the simulation of biological phenomena taking into account both spatial effects and noise. However, the high computational effort characterizing simulation approaches, coupled with the necessity to simulate the models several times to achieve statistically relevant information on the model behaviours, makes such kind of algorithms very time-consuming for studying real systems. So far, different parallelization approaches have been deployed to reduce the computational time required to simulate the temporal dynamics of biochemical systems using stochastic algorithms. In this work we discuss these aspects for the spatial TAU-leaping in crowded compartments (STAUCC) simulator, a voxel-based method for the stochastic simulation of reaction-diffusion processes which relies on the Sτ-DPP algorithm. In particular we present how the characteristics of the algorithm can be exploited for an effective parallelization on the present heterogeneous HPC architectures.
Cytosine DNA methylation is an epigenetic mark implicated in several biological processes. Bisulfite treatment of DNA is acknowledged as the gold standard technique to study methylation. This technique introduces changes in the genomic DNA by converting cytosines to uracils while 5-methylcytosines remain nonreactive. During PCR amplification 5-methylcytosines are amplified as cytosine, whereas uracils and thymines as thymine. To detect the methylation levels, reads treated with the bisulfite must be aligned against a reference genome. Mapping these reads to a reference genome represents a significant computational challenge mainly due to the increased search space and the loss of information introduced by the treatment. To deal with this computational challenge we devised GPU-BSM, a tool based on modern Graphics Processing Units. Graphics Processing Units are hardware accelerators that are increasingly being used successfully to accelerate general-purpose scientific applications. GPU-BSM is a tool able to map bisulfite-treated reads from whole genome bisulfite sequencing and reduced representation bisulfite sequencing, and to estimate methylation levels, with the goal of detecting methylation. Due to the massive parallelization obtained by exploiting graphics cards, GPU-BSM aligns bisulfite-treated reads faster than other cutting-edge solutions, while outperforming most of them in terms of unique mapped reads.
Single Nucleotide Polymorphism (SNP) genotyping analysis is very susceptible to SNPs chromosomal position errors. As it is known, SNPs mapping data are provided along the SNP arrays without any necessary information to assess in advance their accuracy. Moreover, these mapping data are related to a given build of a genome and need to be updated when a new build is available. As a consequence, researchers often plan to remap SNPs with the aim to obtain more up-to-date SNPs chromosomal positions. In this work, we present G-SNPM a GPU (Graphics Processing Unit) based tool to map SNPs on a genome.
G-SNPM is a tool that maps a short sequence representative of a SNP against a reference DNA sequence in order to find the physical position of the SNP in that sequence. In G-SNPM each SNP is mapped on its related chromosome by means of an automatic three-stage pipeline. In the first stage, G-SNPM uses the GPU-based short-read mapping tool SOAP3-dp to parallel align on a reference chromosome its related sequences representative of a SNP. In the second stage G-SNPM uses another short-read mapping tool to remap the sequences unaligned or ambiguously aligned by SOAP3-dp (in this stage SHRiMP2 is used, which exploits specialized vector computing hardware to speed-up the dynamic programming algorithm of Smith-Waterman). In the last stage, G-SNPM analyzes the alignments obtained by SOAP3-dp and SHRiMP2 to identify the absolute position of each SNP.
Results and conclusions
To assess G-SNPM, we used it to remap the SNPs of some commercial chips. Experimental results shown that G-SNPM has been able to remap without ambiguity almost all SNPs. Based on modern GPUs, G-SNPM provides fast mappings without worsening the accuracy of the results. G-SNPM can be used to deal with specialized Genome Wide Association Studies (GWAS), as well as in annotation tasks that require to update the SNP mapping probes.
In the last decades, a wide number of researchers/clinicians involved in tissue engineering field published several works about the possibility to induce a tissue regeneration guided by the use of biomaterials. To this aim, different scaffolds have been proposed, and their effectiveness tested through in vitro and/or in vivo experiments. In this context, integration and meta-analysis approaches are gaining importance for analyses and reuse of data as, for example, those concerning the bone and cartilage biomarkers, the biomolecular factors intervening in cell differentiation and growth, the morphology and the biomechanical performance of a neo-formed tissue, and, in general, the scaffolds' ability to promote tissue regeneration. Therefore standards and ontologies are becoming crucial, to provide a unifying knowledge framework for annotating data and supporting the semantic integration and the unambiguous interpretation of novel experimental results.
In this paper a conceptual framework has been designed for bone/cartilage tissue engineering domain, by now completely lacking standardized methods. A set of guidelines has been provided, defining the minimum information set necessary for describing an experimental study involved in bone and cartilage regenerative medicine field. In addition, a Bone/Cartilage Tissue Engineering Ontology (BCTEO) has been developed to provide a representation of the domain's concepts, specifically oriented to cells, and chemical composition, morphology, physical characterization of biomaterials involved in bone/cartilage tissue engineering research.
Considering that tissue engineering is a discipline that traverses different semantic fields and employs many data types, the proposed instruments represent a first attempt to standardize the domain knowledge and can provide a suitable means to integrate data across the field.
Long-range chromosomal associations between genomic regions, and their repositioning in the 3D space of the nucleus, are now considered to be key contributors to the regulation of gene expression and important links have been highlighted with other genomic features involved in DNA rearrangements. Recent Chromosome Conformation Capture (3C) measurements performed with high throughput sequencing (Hi-C) and molecular dynamics studies show that there is a large correlation between colocalization and coregulation of genes, but these important researches are hampered by the lack of biologists-friendly analysis and visualisation software. Here, we describe NuChart, an R package that allows the user to annotate and statistically analyse a list of input genes with information relying on Hi-C data, integrating knowledge about genomic features that are involved in the chromosome spatial organization. NuChart works directly with sequenced reads to identify the related Hi-C fragments, with the aim of creating gene-centric neighbourhood graphs on which multi-omics features can be mapped. Predictions about CTCF binding sites, isochores and cryptic Recombination Signal Sequences are provided directly with the package for mapping, although other annotation data in bed format can be used (such as methylation profiles and histone patterns). Gene expression data can be automatically retrieved and processed from the Gene Expression Omnibus and ArrayExpress repositories to highlight the expression profile of genes in the identified neighbourhood. Moreover, statistical inferences about the graph structure and correlations between its topology and multi-omics features can be performed using Exponential-family Random Graph Models. The Hi-C fragment visualisation provided by NuChart allows the comparisons of cells in different conditions, thus providing the possibility of novel biomarkers identification. NuChart is compliant with the Bioconductor standard and it is freely available at ftp://fileserver.itb.cnr.it/nuchart.
Cloud computing opens new perspectives for small-medium biotechnology laboratories that need to perform bioinformatics analysis in a flexible and effective way. This seems particularly true for hybrid clouds that couple the scalability offered by general-purpose public clouds with the greater control and ad hoc customizations supplied by the private ones. A hybrid cloud broker, acting as an intermediary between users and public providers, can support customers in the selection of the most suitable offers, optionally adding the provisioning of dedicated services with higher levels of quality. This paper analyses some economic and practical aspects of exploiting cloud computing in a real research scenario for the in silico drug discovery in terms of requirements, costs, and computational load based on the number of expected users. In particular, our work is aimed at supporting both the researchers and the cloud broker delivering an IaaS cloud infrastructure for biotechnology laboratories exposing different levels of nonfunctional requirements.
Galactose-1-phosphate uridylyltransferase (GALT) catalyses the conversion of galactose-1-phosphate to UDP-galactose, a key step in the galactose metabolism. Deficiency of GALT activity in humans caused by deleterious variations in the GALT gene can cause a potentially lethal disease called Classic Galactosemia. In this study, we selected 14 novel nucleotide sequence changes in the GALT genes found in galactosemic patients for expression analysis and molecular modeling. Several variants showed decreased levels of expression and decreased abundance in the soluble fraction of the Escherichia coli cell extracts, suggesting altered stability and solubility. Only six variant GALT enzymes had detectable enzymatic activities. Kinetic studies showed that their Vmax decreased significantly. To further characterize the variants at molecular level, we performed static and dynamic molecular modeling studies. Effects of variations on local and/or global structural features of the enzyme were anticipated for the majority of variants. In-depth studies with molecular dynamic simulations on selected variants predicted the alteration of the protein structure even when static models apparently did not highlight any perturbation. Overall, these studies offered new insights on the molecular properties of GALT enzyme, with the aim of correlating them with the clinical outcome.
Galactosemia; GALT; Heterologous expression; Molecular modeling
The capability of correlating specific genotypes with human diseases is a complex issue in spite of all advantages arisen from high-throughput technologies, such as Genome Wide Association Studies (GWAS). New tools for genetic variants interpretation and for Single Nucleotide Polymorphisms (SNPs) prioritization are actually needed. Given a list of the most relevant SNPs statistically associated to a specific pathology as result of a genotype study, a critical issue is the identification of genes that are effectively related to the disease by re-scoring the importance of the identified genetic variations. Vice versa, given a list of genes, it can be of great importance to predict which SNPs can be involved in the onset of a particular disease, in order to focus the research on their effects.
We propose a new bioinformatics approach to support biological data mining in the analysis and interpretation of SNPs associated to pathologies. This system can be employed to design custom genotyping chips for disease-oriented studies and to re-score GWAS results. The proposed method relies (1) on the data integration of public resources using a gene-centric database design, (2) on the evaluation of a set of static biomolecular annotations, defined as features, and (3) on the SNP scoring function, which computes SNP scores using parameters and weights set by users. We employed a machine learning classifier to set default feature weights and an ontological annotation layer to enable the enrichment of the input gene set. We implemented our method as a web tool called SNPranker 2.0 (http://www.itb.cnr.it/snpranker), improving our first published release of this system. A user-friendly interface allows the input of a list of genes, SNPs or a biological process, and to customize the features set with relative weights. As result, SNPranker 2.0 returns a list of SNPs, localized within input and ontologically enriched genes, combined with their prioritization scores.
Different databases and resources are already available for SNPs annotation, but they do not prioritize or re-score SNPs relying on a-priori biomolecular knowledge. SNPranker 2.0 attempts to fill this gap through a user-friendly integrated web resource. End users, such as researchers in medical genetics and epidemiology, may find in SNPranker 2.0 a new tool for data mining and interpretation able to support SNPs analysis. Possible scenarios are GWAS data re-scoring, SNPs selection for custom genotyping arrays and SNPs/diseases association studies.
Investigating ligand-regulated allosteric coupling between protein domains is fundamental to understand cell-life regulation. The Hsp70 family of chaperones represents an example of proteins in which ATP binding and hydrolysis at the Nucleotide Binding Domain (NBD) modulate substrate recognition at the Substrate Binding Domain (SBD). Herein, a comparative analysis of an allosteric (Hsp70-DnaK) and a non-allosteric structural homolog (Hsp110-Sse1) of the Hsp70 family is carried out through molecular dynamics simulations, starting from different conformations and ligand-states. Analysis of ligand-dependent modulation of internal fluctuations and local deformation patterns highlights the structural and dynamical changes occurring at residue level upon ATP-ADP exchange, which are connected to the conformational transition between closed and open structures. By identifying the dynamically responsive protein regions and specific cross-domain hydrogen-bonding patterns that differentiate Hsp70 from Hsp110 as a function of the nucleotide, we propose a molecular mechanism for the allosteric signal propagation of the ATP-encoded conformational signal.
Allostery, or the capability of proteins to respond to ligand binding events with a variation in structure or dynamics at a distant site, is a common feature for biomolecular function and regulation in a large number of proteins. Intra-protein connections and inter-residue coordinations underlie allosteric mechanisms and react to binding primarily through a finely tuned modulation of motions and structures at the microscopic scale. Hence, all-atom molecular dynamics simulations are suitable to investigate the molecular basis of allostery. Moreover, understanding intra-protein communication pathways at atomistic resolutions offers unique opportunities in rational drug design. Proteins of the Hsp70 family are allosteric molecular chaperones involved in maintaining cellular protein homeostasis. These proteins are involved in several types of cancer, neurodegenerative diseases, aging and infections and are therefore pharmaceutically relevant targets. In this work we have analyzed, by multiple molecular dynamics simulations, the long-range dynamical and conformational effects of ligands bound to Hsp70, and found relevant differences in comparison to the known non-allosteric structural homolog Hsp110. The resulting model of the mechanism of allosteric propagation offers the opportunity of identifying on-pathway allosteric druggable sites, which we propose could guide rational drug-design efforts targeting Hsp70.
Signal transduction and gene regulation determine a major reorganization of metabolic activities in order to support cell proliferation. Protein Kinase B (PKB), also known as Akt, participates in the PI3K/Akt/mTOR pathway, a master regulator of aerobic glycolysis and cellular biosynthesis, two activities shown by both normal and cancer proliferating cells. Not surprisingly considering its relevance for cellular metabolism, Akt/PKB is often found hyperactive in cancer cells. In the last decade, many efforts have been made to improve the understanding of the control of glucose metabolism and the identification of a therapeutic window between proliferating cancer cells and proliferating normal cells. In this context, we have modeled the link between the PI3K/Akt/mTOR pathway, glycolysis, lactic acid production, and nucleotide biosynthesis. We used a computational model to compare two metabolic states generated by two different levels of signaling through the PI3K/Akt/mTOR pathway: one of the two states represents the metabolism of a growing cancer cell characterized by aerobic glycolysis and cellular biosynthesis, while the other state represents the same metabolic network with a reduced glycolytic rate and a higher mitochondrial pyruvate metabolism. Biochemical reactions that link glycolysis and pentose phosphate pathway revealed their importance for controlling the dynamics of cancer glucose metabolism.
PI3K/Akt/mTOR pathway; metabolism; kinetic models; glycolysis; cancer
The world has widely changed in terms of communicating, acquiring, and storing information. Hundreds of millions of people are involved in information retrieval tasks on a daily basis, in particular while using a Web search engine or searching their e-mail, making such field the dominant form of information access, overtaking traditional database-style searching. How to handle this huge amount of information has now become a challenging issue. In this paper, after recalling the main topics concerning information retrieval, we present a survey on the main works on literature retrieval and mining in bioinformatics. While claiming that information retrieval approaches are useful in bioinformatics tasks, we discuss some challenges aimed at showing the effectiveness of these approaches applied therein.
Tissue MicroArray technology aims to perform immunohistochemical staining on hundreds of different tissue samples simultaneously. It allows faster analysis, considerably reducing costs incurred in staining. A time consuming phase of the methodology is the selection of tissue areas within paraffin blocks: no utilities have been developed for the identification of areas to be punched from the donor block and assembled in the recipient block.
The presented work supports, in the specific case of a primary subtype of breast cancer (tubular breast cancer), the semi-automatic discrimination and localization between normal and pathological regions within the tissues. The diagnosis is performed by analysing specific morphological features of the sample such as the absence of a double layer of cells around the lumen and the decay of a regular glands-and-lobules structure. These features are analysed using an algorithm which performs the extraction of morphological parameters from images and compares them to experimentally validated threshold values. Results are satisfactory since in most of the cases the automatic diagnosis matches the response of the pathologists. In particular, on a total of 1296 sub-images showing normal and pathological areas of breast specimens, algorithm accuracy, sensitivity and specificity are respectively 89%, 84% and 94%.
The proposed work is a first attempt to demonstrate that automation in the Tissue MicroArray field is feasible and it can represent an important tool for scientists to cope with this high-throughput technique.
Breast cancer is one of the most common cancer types. Due to the complexity of this disease, it is important to face its study with an integrated and multilevel approach, from genes, transcripts and proteins to molecular networks, cell populations and tissues. According to the systems biology perspective, the biological functions arise from complex networks: in this context, concepts like molecular pathways, protein-protein interactions (PPIs), mathematical models and ontologies play an important role for dissecting such complexity.
In this work we present the Genes-to-Systems Breast Cancer (G2SBC) Database, a resource which integrates data about genes, transcripts and proteins reported in literature as altered in breast cancer cells. Beside the data integration, we provide an ontology based query system and analysis tools related to intracellular pathways, PPIs, protein structure and systems modelling, in order to facilitate the study of breast cancer using a multilevel perspective. The resource is available at the URL http://www.itb.cnr.it/breastcancer.
The G2SBC Database represents a systems biology oriented data integration approach devoted to breast cancer. By means of the analysis capabilities provided by the web interface, it is possible to overcome the limits of reductionist resources, enabling predictions that can lead to new experiments.
Recombination signal sequences (RSSs) flanking V, D and J gene segments are recognized and cut by the VDJ recombinase during development of B and T lymphocytes. All RSSs are composed of seven conserved nucleotides, followed by a spacer (containing either 12 ± 1 or 23 ± 1 poorly conserved nucleotides) and a conserved nonamer. Errors in V(D)J recombination, including cleavage of cryptic RSS outside the immunoglobulin and T cell receptor loci, are associated with oncogenic translocations observed in some lymphoid malignancies. We present in this paper the RSSsite web server, which is available from the address http://www.itb.cnr.it/rss. RSSsite consists of a web-accessible database, RSSdb, for the identification of pre-computed potential RSSs, and of the related search tool, DnaGrab, which allows the scoring of potential RSSs in user-supplied sequences. This latter algorithm makes use of probability models, which can be recasted to Bayesian network, taking into account correlations between groups of positions of a sequence, developed starting from specific reference sets of RSSs. In validation laboratory experiments, we selected 33 predicted cryptic RSSs (cRSSs) from 11 chromosomal regions outside the immunoglobulin and TCR loci for functional testing.
The identification of the organisation and dynamics of molecular pathways is crucial for the understanding of cell function. In order to reconstruct the molecular pathways in which a gene of interest is involved in regulating a cell, it is important to identify the set of genes to which it interacts with to determine cell function. In this context, the mining and the integration of a large amount of publicly available data, regarding the transcriptome and the proteome states of a cell, are a useful resource to complement biological research.
We describe an approach for the identification of genes that interact with each other to regulate cell function. The strategy relies on the analysis of gene expression profile similarity, considering large datasets of expression data. During the similarity evaluation, the methodology determines the most significant subset of samples in which the evaluated genes are highly correlated. Hence, the strategy enables the exclusion of samples that are not relevant for each gene pair analysed. This feature is important when considering a large set of samples characterised by heterogeneous experimental conditions where different pools of biological processes can be active across the samples. The putative partners of the studied gene are then further characterised, analysing the distribution of the Gene Ontology terms and integrating the protein-protein interaction (PPI) data. The strategy was applied for the analysis of the functional relationships of a gene of known function, Pyruvate Kinase, and for the prediction of functional partners of the human transcription factor TBX3. In both cases the analysis was done on a dataset composed by breast primary tumour expression data derived from the literature. Integration and analysis of PPI data confirmed the prediction of the methodology, since the genes identified to be functionally related were associated to proteins close in the PPI network. Two genes among the predicted putative partners of TBX3 (GLI3 and GATA3) were confirmed by in vivo binding assays (crosslinking immunoprecipitation, X-ChIP) in which the putative DNA enhancer sequence sites of GATA3 and GLI3 were found to be bound by the Tbx3 protein.
The presented strategy is demonstrated to be an effective approach to identify genes that establish functional relationships. The methodology identifies and characterises genes with a similar expression profile, through data mining and integrating data from publicly available resources, to contribute to a better understanding of gene regulation and cell function. The prediction of the TBX3 target genes GLI3 and GATA3 was experimentally confirmed.
The design of mutants in protein functional regions, such as the ligand binding sites, is a powerful approach to recognize the determinants of specific protein activities in cellular pathways. For an exhaustive analysis of selected positions of protein structure large scale mutagenesis techniques are often employed, with laborious and time consuming experimental set-up. 'In silico' mutagenesis and screening simulation represents a valid alternative to laboratory methods to drive the 'in vivo' testing toward more focused objectives.
We present here a high performance computational procedure for large-scale mutant modelling and subsequent evaluation of the effect on ligand binding affinity. The mutagenesis was performed with a 'saturation' approach, where all 20 natural amino acids were tested in positions involved in ligand binding sites. Each modelled mutant was subjected to molecular docking simulation and stability evaluation. The simulated protein-ligand complexes were screened for their impairment of binding ability based on change of calculated Ki compared to the wild-type.
An example of application to the Endothelial Protein C Receptor residues involved in lipid binding is reported.
The computational pipeline presented in this work is a useful tool for the design of structurally stable mutants with altered affinity for ligand binding, considerably reducing the number of mutants to be experimentally tested. The saturation mutagenesis procedure does not require previous knowledge of functional role of the residues involved and allows extensive exploration of all possible substitutions and their pairwise combinations. Mutants are screened by docking simulation and stability evaluation followed by a rationally driven selection of those presenting the required characteristics. The method can be employed in molecular recognition studies and as a preliminary approach to select models for experimental testing.
The NCBI dbEST currently contains more than eight million human Expressed Sequenced Tags (ESTs). This wide collection represents an important source of information for gene expression studies, provided it can be inspected according to biologically relevant criteria. EST data can be browsed using different dedicated web resources, which allow to investigate library specific gene expression levels and to make comparisons among libraries, highlighting significant differences in gene expression. Nonetheless, no tool is available to examine distributions of quantitative EST collections in Gene Ontology (GO) categories, nor to retrieve information concerning library-dependent EST involvement in metabolic pathways. In this work we present the Human EST Ontology Explorer (HEOE) , a web facility for comparison of expression levels among libraries from several healthy and diseased tissues.
The HEOE provides library-dependent statistics on the distribution of sequences in the GO Direct Acyclic Graph (DAG) that can be browsed at each GO hierarchical level. The tool is based on large-scale BLAST annotation of EST sequences. Due to the huge number of input sequences, this BLAST analysis was performed with the aid of grid computing technology, which is particularly suitable to address data parallel task. Relying on the achieved annotation, library-specific distributions of ESTs in the GO Graph were inferred. A pathway-based search interface was also implemented, for a quick evaluation of the representation of libraries in metabolic pathways. EST processing steps were integrated in a semi-automatic procedure that relies on Perl scripts and stores results in a MySQL database. A PHP-based web interface offers the possibility to simultaneously visualize, retrieve and compare data from the different libraries. Statistically significant differences in GO categories among user selected libraries can also be computed.
The HEOE provides an alternative and complementary way to inspect EST expression levels with respect to approaches currently offered by other resources. Furthermore, BLAST computation on the whole human EST dataset was a suitable test of grid scalability in the context of large-scale bioinformatics analysis. The HEOE currently comprises sequence analysis from 70 non-normalized libraries, representing a comprehensive overview on healthy and unhealthy tissues. As the analysis procedure can be easily applied to other libraries, the number of represented tissues is intended to increase.
The cell cycle is a complex process that allows eukaryotic cells to replicate chromosomal DNA and partition it into two daughter cells. A relevant regulatory step is in the G0/G1 phase, a point called the restriction (R) point where intracellular and extracellular signals are monitored and integrated.
Subcellular localization of cell cycle proteins is increasingly recognized as a major factor that regulates cell cycle transitions. Nevertheless, current mathematical models of the G1/S networks of mammalian cells do not consider this aspect. Hence, there is a need for a computational model that incorporates this regulatory aspect that has a relevant role in cancer, since altered localization of key cell cycle players, notably of inhibitors of cyclin-dependent kinases, has been reported to occur in neoplastic cells and to be linked to cancer aggressiveness.
The network of the model components involved in the G1 to S transition process was identified through a literature and web-based data mining and the corresponding wiring diagram of the G1 to S transition drawn with Cell Designer notation. The model has been implemented in Mathematica using Ordinary Differential Equations. Time-courses of level and of sub-cellular localization of key cell cycle players in mouse fibroblasts re-entering the cell cycle after serum starvation/re-feeding have been used to constrain network design and parameter determination. The model allows to recapitulate events from growth factor stimulation to the onset of S phase. The R point estimated by simulation is consistent with the R point experimentally determined.
The major element of novelty of our model of the G1 to S transition is the explicit modeling of cytoplasmic/nuclear shuttling of cyclins, cyclin-dependent kinases, their inhibitor and complexes. Sensitivity analysis of the network performance newly reveals that the biological effect brought about by Cki overexpression is strictly dependent on whether the Cki is promoting nuclear translocation of cyclin/Cdk containing complexes.
In this paper we provide an introduction to the techniques for multi-scale complex biological systems, from the single bio-molecule to the cell, combining theoretical modeling, experiments, informatics tools and technologies suitable for biological and biomedical research, which are becoming increasingly multidisciplinary, multidimensional and information-driven. The most important concepts on mathematical modeling methodologies and statistical inference, bioinformatics and standards tools to investigate complex biomedical systems are discussed and the prominent literature useful to both the practitioner and the theoretician are presented.
Two complete genome sequences are available for Vitis vinifera Pinot noir. Based on the sequence and gene predictions produced by the IASMA, we performed an in silico detection of putative microRNA genes and of their targets, and collected the most reliable microRNA predictions in a web database. The application is available at .
The program FindMiRNA was used to detect putative microRNA genes in the grape genome. A very high number of predictions was retrieved, calling for validation. Nine parameters were calculated and, based on the grape microRNAs dataset available at miRBase, thresholds were defined and applied to FindMiRNA predictions having targets in gene exons. In the resulting subset, predictions were ranked according to precursor positions and sequence similarity, and to target identity. To further validate FindMiRNA predictions, comparisons to the Arabidopsis genome, to the grape Genoscope genome, and to the grape EST collection were performed. Results were stored in a MySQL database and a web interface was prepared to query the database and retrieve predictions of interest.
The GrapeMiRNA database encompasses 5,778 microRNA predictions spanning the whole grape genome. Predictions are integrated with information that can be of use in selection procedures. Tools added in the web interface also allow to inspect predictions according to gene ontology classes and metabolic pathways of targets. The GrapeMiRNA database can be of help in selecting candidate microRNA genes to be validated.
Grid technology is the computing model which allows users to share a wide pletora of distributed computational resources regardless of their geographical location. Up to now, the high security policy requested in order to access distributed computing resources has been a rather big limiting factor when trying to broaden the usage of Grids into a wide community of users. Grid security is indeed based on the Public Key Infrastructure (PKI) of X.509 certificates and the procedure to get and manage those certificates is unfortunately not straightforward. A first step to make Grids more appealing for new users has recently been achieved with the adoption of robot certificates.
Robot certificates have recently been introduced to perform automated tasks on Grids on behalf of users. They are extremely useful for instance to automate grid service monitoring, data processing production, distributed data collection systems. Basically these certificates can be used to identify a person responsible for an unattended service or process acting as client and/or server. Robot certificates can be installed on a smart card and used behind a portal by everyone interested in running the related applications in a Grid environment using a user-friendly graphic interface. In this work, the GENIUS Grid Portal, powered by EnginFrame, has been extended in order to support the new authentication based on the adoption of these robot certificates.
The work carried out and reported in this manuscript is particularly relevant for all users who are not familiar with personal digital certificates and the technical aspects of the Grid Security Infrastructure (GSI). The valuable benefits introduced by robot certificates in e-Science can so be extended to users belonging to several scientific domains, providing an asset in raising Grid awareness to a wide number of potential users.
The adoption of Grid portals extended with robot certificates, can really contribute to creating transparent access to computational resources of Grid Infrastructures, enhancing the spread of this new paradigm in researchers' working life to address new global scientific challenges. The evaluated solution can of course be extended to other portals, applications and scientific communities.
Tissue MicroArray technique is becoming increasingly important in pathology for the validation of experimental data from transcriptomic analysis. This approach produces many images which need to be properly managed, if possible with an infrastructure able to support tissue sharing between institutes. Moreover, the available frameworks oriented to Tissue MicroArray provide good storage for clinical patient, sample treatment and block construction information, but their utility is limited by the lack of data integration with biomolecular information.
In this work we propose a Tissue MicroArray web oriented system to support researchers in managing bio-samples and, through the use of ontologies, enables tissue sharing aimed at the design of Tissue MicroArray experiments and results evaluation. Indeed, our system provides ontological description both for pre-analysis tissue images and for post-process analysis image results, which is crucial for information exchange. Moreover, working on well-defined terms it is then possible to query web resources for literature articles to integrate both pathology and bioinformatics data.
Using this system, users associate an ontology-based description to each image uploaded into the database and also integrate results with the ontological description of biosequences identified in every tissue. Moreover, it is possible to integrate the ontological description provided by the user with a full compliant gene ontology definition, enabling statistical studies about correlation between the analyzed pathology and the most commonly related biological processes.
Significance analysis at single gene level may suffer from the limited number of samples and experimental noise that can severely limit the power of the chosen statistical test. This problem is typically approached by applying post hoc corrections to control the false discovery rate, without taking into account prior biological knowledge. Pathway or gene ontology analysis can provide an alternative way to relax the significance threshold applied to single genes and may lead to a better biological interpretation.
Here we propose a new analysis method based on the study of networks of pathways. These networks are reconstructed considering both the significance of single pathways (network nodes) and the intersection between them (links).
We apply this method for the reconstruction of networks of pathways to two gene expression datasets: the first one obtained from a c-Myc rat fibroblast cell line expressing a conditional Myc-estrogen receptor oncoprotein; the second one obtained from the comparison of Acute Myeloid Leukemia and Acute Lymphoblastic Leukemia derived from bone marrow samples.
Our method extends statistical models that have been recently adopted for the significance analysis of functional groups of genes to infer links between these groups. We show that groups of genes at the interface between different pathways can be considered as relevant even if the pathways they belong to are not significant by themselves.
Recent progresses in genotyping technologies allow the generation high-density genetic maps using hundreds of thousands of genetic markers for each DNA sample. The availability of this large amount of genotypic data facilitates the whole genome search for genetic basis of diseases.
We need a suitable information management system to efficiently manage the data flow produced by whole genome genotyping and to make it available for further analyses.
We have developed an information system mainly devoted to the storage and management of SNP genotype data produced by the Illumina platform from the raw outputs of genotyping into a relational database.
The relational database can be accessed in order to import any existing data and export user-defined formats compatible with many different genetic analysis programs.
After calculating family-based or case-control association study data, the results can be imported in SNPLims. One of the main features is to allow the user to rapidly identify and annotate statistically relevant polymorphisms from the large volume of data analyzed. Results can be easily visualized either graphically or creating ASCII comma separated format output files, which can be used as input to further analyses.
The proposed infrastructure allows to manage a relatively large amount of genotypes for each sample and an arbitrary number of samples and phenotypes. Moreover, it enables the users to control the quality of the data and to perform the most common screening analyses and identify genes that become “candidate” for the disease under consideration.
The ESTree database (db) is a collection of Prunus persica and Prunus dulcis EST sequences that in its current version encompasses 75,404 sequences from 3 almond and 19 peach libraries. Nine peach genotypes and four peach tissues are represented, from four fruit developmental stages. The aim of this work was to implement the already existing ESTree db by adding new sequences and analysis programs. Particular care was given to the implementation of the web interface, that allows querying each of the database features.
A Perl modular pipeline is the backbone of sequence analysis in the ESTree db project. Outputs obtained during the pipeline steps are automatically arrayed into the fields of a MySQL database. Apart from standard clustering and annotation analyses, version VI of the ESTree db encompasses new tools for tandem repeat identification, annotation against genomic Rosaceae sequences, and positioning on the database of oligomer sequences that were used in a peach microarray study. Furthermore, known protein patterns and motifs were identified by comparison to PROSITE. Based on data retrieved from sequence annotation against the UniProtKB database, a script was prepared to track positions of homologous hits on the GO tree and build statistics on the ontologies distribution in GO functional categories. EST mapping data were also integrated in the database. The PHP-based web interface was upgraded and extended. The aim of the authors was to enable querying the database according to all the biological aspects that can be investigated from the analysis of data available in the ESTree db. This is achieved by allowing multiple searches on logical subsets of sequences that represent different biological situations or features.
The version VI of ESTree db offers a broad overview on peach gene expression. Sequence analyses results contained in the database, extensively linked to external related resources, represent a large amount of information that can be queried via the tools offered in the web interface. Flexibility and modularity of the ESTree analysis pipeline and of the web interface allowed the authors to set up similar structures for different datasets, with limited manual intervention.