The mountains of data thrusting from the new landscape of modern high-throughput biology are irrevocably changing biomedical research and creating a near-insatiable demand for training in data management and manipulation and data mining and analysis. Among life scientists, from clinicians to environmental researchers, a common theme is the need not just to use, and gain familiarity with, bioinformatics tools and resources but also to understand their underlying fundamental theoretical and practical concepts. Providing bioinformatics training to empower life scientists to handle and analyse their data efficiently, and progress their research, is a challenge across the globe. Delivering good training goes beyond traditional lectures and resource-centric demos, using interactivity, problem-solving exercises and cooperative learning to substantially enhance training quality and learning outcomes. In this context, this article discusses various pragmatic criteria for identifying training needs and learning objectives, for selecting suitable trainees and trainers, for developing and maintaining training skills and evaluating training quality. Adherence to these criteria may help not only to guide course organizers and trainers on the path towards bioinformatics training excellence but, importantly, also to improve the training experience for life scientists.
bioinformatics; training; bioinformatics courses; training life scientists; train the trainers
Summary: We present iAnn, an open source community-driven platform for dissemination of life science events, such as courses, conferences and workshops. iAnn allows automatic visualisation and integration of customised event reports. A central repository lies at the core of the platform: curators add submitted events, and these are subsequently accessed via web services. Thus, once an iAnn widget is incorporated into a website, it permanently shows timely relevant information as if it were native to the remote site. At the same time, announcements submitted to the repository are automatically disseminated to all portals that query the system. To facilitate the visualization of announcements, iAnn provides powerful filtering options and views, integrated in Google Maps and Google Calendar. All iAnn widgets are freely available.
Analysis of DNA copy number alterations and gene expression changes in human samples have been used to find potential target genes in complex diseases. Recent studies have combined these two types of data using different strategies, but focusing on finding gene-based relationships. However, it has been proposed that these data can be used to identify key genomic regions, which may enclose causal genes under the assumption that disease-associated gene expression changes are caused by genomic alterations.
Following this proposal, we undertake a new integrative analysis of genome-wide expression and copy number datasets. The analysis is based on the combined location of both types of signals along the genome. Our approach takes into account the genomic location in the copy number (CN) analysis and also in the gene expression (GE) analysis. To achieve this we apply a segmentation algorithm to both types of data using paired samples. Then, we perform a correlation analysis and a frequency analysis of the gene loci in the segmented CN regions and the segmented GE regions; selecting in both cases the statistically significant loci. In this way, we find CN alterations that show strong correspondence with GE changes. We applied our method to a human dataset of 64 Glioblastoma Multiforme samples finding key loci and hotspots that correspond to major alterations previously described for this type of tumors.
Identification of key altered genomic loci constitutes a first step to find the genes that drive the alteration in a malignant state. These driver genes can be found in regions that show high correlation in copy number alterations and expression changes.
Motivation: Recent developments in experimental methods facilitate increasingly larger signal transduction datasets. Two main approaches can be taken to derive a mathematical model from these data: training a network (obtained, e.g., from literature) to the data, or inferring the network from the data alone. Purely data-driven methods scale up poorly and have limited interpretability, whereas literature-constrained methods cannot deal with incomplete networks.
Results: We present an efficient approach, implemented in the R package CNORfeeder, to integrate literature-constrained and data-driven methods to infer signalling networks from perturbation experiments. Our method extends a given network with links derived from the data via various inference methods, and uses information on physical interactions of proteins to guide and validate the integration of links. We apply CNORfeeder to a network of growth and inflammatory signalling. We obtain a model with superior data fit in the human liver cancer HepG2 and propose potential missing pathways.
Availability: CNORfeeder is in the process of being submitted to Bioconductor and in the meantime available at www.cellnopt.org.
Supplementary data are available at Bioinformatics online.
Funding bodies are increasingly recognizing the need to provide graduates and researchers with access to short intensive courses in a variety of disciplines, in order both to improve the general skills base and to provide solid foundations on which researchers may build their careers. In response to the development of ‘high-throughput biology’, the need for training in the field of bioinformatics, in particular, is seeing a resurgence: it has been defined as a key priority by many Institutions and research programmes and is now an important component of many grant proposals. Nevertheless, when it comes to planning and preparing to meet such training needs, tension arises between the reward structures that predominate in the scientific community which compel individuals to publish or perish, and the time that must be devoted to the design, delivery and maintenance of high-quality training materials. Conversely, there is much relevant teaching material and training expertise available worldwide that, were it properly organized, could be exploited by anyone who needs to provide training or needs to set up a new course. To do this, however, the materials would have to be centralized in a database and clearly tagged in relation to target audiences, learning objectives, etc. Ideally, they would also be peer reviewed, and easily and efficiently accessible for downloading. Here, we present the Bioinformatics Training Network (BTN), a new enterprise that has been initiated to address these needs and review it, respectively, to similar initiatives and collections.
Bioinformatics; training; end users; bioinformatics courses; learning bioinformatics
Functional analysis of large sets of genes and proteins is becoming more and more necessary with the increase of experimental biomolecular data at omic-scale. Enrichment analysis is by far the most popular available methodology to derive functional implications of sets of cooperating genes. The problem with these techniques relies in the redundancy of resulting information, that in most cases generate lots of trivial results with high risk to mask the reality of key biological events. We present and describe a computational method, called GeneTerm Linker, that filters and links enriched output data identifying sets of associated genes and terms, producing metagroups of coherent biological significance. The method uses fuzzy reciprocal linkage between genes and terms to unravel their functional convergence and associations. The algorithm is tested with a small set of well known interacting proteins from yeast and with a large collection of reference sets from three heterogeneous resources: multiprotein complexes (CORUM), cellular pathways (SGD) and human diseases (OMIM). Statistical Precision, Recall and balanced F-score are calculated showing robust results, even when different levels of random noise are included in the test sets. Although we could not find an equivalent method, we present a comparative analysis with a widely used method that combines enrichment and functional annotation clustering. A web application to use the method here proposed is provided at http://gtlinker.cnb.csic.es.
Genome-wide expression studies have developed exponentially in recent years as a result of extensive use of microarray technology. However, expression signals are typically calculated using the assignment of "probesets" to genes, without addressing the problem of "gene" definition or proper consideration of the location of the measuring probes in the context of the currently known genomes and transcriptomes. Moreover, as our knowledge of metazoan genomes improves, the number of both protein-coding and noncoding genes, as well as their associated isoforms, continues to increase. Consequently, there is a need for new databases that combine genomic and transcriptomic information and provide updated mapping of expression probes to current genomic annotations.
GATExplorer (Genomic and Transcriptomic Explorer) is a database and web platform that integrates a gene loci browser with nucleotide level mappings of oligo probes from expression microarrays. It allows interactive exploration of gene loci, transcripts and exons of human, mouse and rat genomes, and shows the specific location of all mappable Affymetrix microarray probes and their respective expression levels in a broad set of biological samples. The web site allows visualization of probes in their genomic context together with any associated protein-coding or noncoding transcripts. In the case of all-exon arrays, this provides a means by which the expression of the individual exons within a gene can be compared, thereby facilitating the identification and analysis of alternatively spliced exons. The application integrates data from four major source databases: Ensembl, RNAdb, Affymetrix and GeneAtlas; and it provides the users with a series of files and packages (R CDFs) to analyze particular query expression datasets. The maps cover both the widely used Affymetrix GeneChip microarrays based on 3' expression (e.g. human HG U133 series) and the all-exon expression microarrays (Gene 1.0 and Exon 1.0).
GATExplorer is an integrated database that combines genomic/transcriptomic visualization with nucleotide-level probe mapping. By considering expression at the nucleotide level rather than the gene level, it shows that the arrays detect expression signals from entities that most researchers do not contemplate or discriminate. This approach provides the means to undertake a higher resolution analysis of microarray data and potentially extract considerably more detailed and biologically accurate information from existing and future microarray experiments.
Transcriptional and functional analysis reveals that the H-Ras and N-Ras isoforms have different roles in the initial phases of the mouse cell cycle
Using oligonucleotide microarrays, we compared transcriptional profiles corresponding to the initial cell cycle stages of mouse fibroblasts lacking the small GTPases H-Ras and/or N-Ras with those of matching, wild-type controls.
Serum-starved wild-type and knockout ras fibroblasts had very similar transcriptional profiles, indicating that H-Ras and N-Ras do not significantly control transcriptional responses to serum deprivation stress. In contrast, genomic disruption of H-ras or N-ras, individually or in combination, determined specific differential gene expression profiles in response to post-starvation stimulation with serum for 1 hour (G0/G1 transition) or 8 hours (mid-G1 progression). The absence of N-Ras caused significantly higher changes than the absence of H-Ras in the wave of transcriptional activation linked to G0/G1 transition. In contrast, the absence of H-Ras affected the profile of the transcriptional wave detected during G1 progression more strongly than did the absence of N-Ras. H-Ras was predominantly functionally associated with growth and proliferation, whereas N-Ras had a closer link to the regulation of development, the cell cycle, immunomodulation and apoptosis. Mechanistic analysis indicated that extracellular signal-regulated kinase (ERK)-dependent activation of signal transducer and activator of transcription 1 (Stat1) mediates the regulatory effect of N-Ras on defense and immunity, whereas the pro-apoptotic effects of N-Ras are mediated through ERK and p38 mitogen-activated protein kinase signaling.
Our observations confirm the notion of an absolute requirement for different peaks of Ras activity during the initial stages of the cell cycle and document the functional specificity of H-Ras and N-Ras during those processes.
DNA microarrays provide rich profiles that are used in
cancer prediction considering the gene expression levels
across a collection of related samples. Support Vector Machines
(SVM) have been applied to the classification of cancer
samples with encouraging results. However, they rely on
Euclidean distances that fail to reflect accurately the proximities
among sample profiles. Then, non-Euclidean dissimilarities
provide additional information that should be considered
to reduce the misclassification errors.
In this paper, we incorporate in the ν-SVM algorithm a
linear combination of non-Euclidean dissimilarities. The
weights of the combination are learnt in a (Hyper
Reproducing Kernel Hilbert Space) HRKHS using a Semidefinite
Programming algorithm. This approach allows us to incorporate
a smoothing term that penalizes the complexity of the
family of distances and avoids overfitting. The experimental results suggest that the method proposed
helps to reduce the misclassification errors in several
human cancer problems.
Analysis of gene expression data using genome-wide microarrays is a technique often used in genomic studies to find coexpression patterns and locate groups of co-transcribed genes. However, most studies done at global “omic” scale are not focused on human samples and when they correspond to human very often include heterogeneous datasets, mixing normal with disease-altered samples. Moreover, the technical noise present in genome-wide expression microarrays is another well reported problem that many times is not addressed with robust statistical methods, and the estimation of errors in the data is not provided.
Human genome-wide expression data from a controlled set of normal-healthy tissues is used to build a confident human gene coexpression network avoiding both pathological and technical noise. To achieve this we describe a new method that combines several statistical and computational strategies: robust normalization and expression signal calculation; correlation coefficients obtained by parametric and non-parametric methods; random cross-validations; and estimation of the statistical accuracy and coverage of the data. All these methods provide a series of coexpression datasets where the level of error is measured and can be tuned. To define the errors, the rates of true positives are calculated by assignment to biological pathways. The results provide a confident human gene coexpression network that includes 3327 gene-nodes and 15841 coexpression-links and a comparative analysis shows good improvement over previously published datasets. Further functional analysis of a subset core network, validated by two independent methods, shows coherent biological modules that share common transcription factors. The network reveals a map of coexpression clusters organized in well defined functional constellations. Two major regions in this network correspond to genes involved in nuclear and mitochondrial metabolism and investigations on their functional assignment indicate that more than 60% are house-keeping and essential genes. The network displays new non-described gene associations and it allows the placement in a functional context of some unknown non-assigned genes based on their interactions with known gene families.
The identification of stable and reliable human gene to gene coexpression networks is essential to unravel the interactions and functional correlations between human genes at an omic scale. This work contributes to this aim, and we are making available for the scientific community the validated human gene coexpression networks obtained, to allow further analyses on the network or on some specific gene associations.
The data are available free online at http://bioinfow.dep.usal.es/coexpression/.
The 90S preribosomal particle is required for the production of the 18S rRNA from a pre-rRNA precursor. Despite the identification of the protein components of this particle, its mechanism of assembly and structural design remain unknown. In this work, we have combined biochemical studies, proteomic techniques, and bioinformatic analyses to shed light into the rules of assembly of the yeast 90S preribosome. Our results indicate that several protein subcomplexes work as discrete assembly subunits that bind in defined steps to the 35S pre-rRNA. The assembly of the t-UTP subunit is an essential step for the engagement of at least five additional subunits in two separate, and mutually independent, assembling routes. One of these routes leads to the formation of an assembly intermediate composed of the U3 snoRNP, the Pwp2p/UTP-B, subunit and the Mpp10p complex. The other assembly route involves the stepwise binding of Rrp5p and the UTP-C subunit. We also report the use of a bioinformatic approach that provides a model for the topological arrangement of protein components within the fully assembled particle. Together, our data identify the mechanism of assembly of the 90S preribosome and offer novel information about its internal architecture.
The pyridine nucleotide disulfide reductase (PNDR) is a large and heterogeneous protein family divided into two classes (I and II), which reflect the divergent evolution of its characteristic disulfide redox active site. However, not all the PNDR members fit into these categories and this suggests the need of further studies to achieve a more comprehensive classification of this complex family.
A workflow to improve the clusterization of protein families based on the array of linear conserved motifs is designed. The method is applied to the PNDR large family finding two main groups, which correspond to PNDR classes I and II. However, two other separate protein clusters, previously classified as class I in most databases, are outgrouped: the peroxide reductases (NAOX, NAPE) and the type II NADH dehydrogenases (NDH-2). In this way, two novel PNDR classes III and IV for NAOX/NAPE and NDH-2 respectively are proposed. By knowledge-driven biochemical and functional data analyses done on the new class IV, a linear array of motifs putatively related to Cu(II)-reductase activity is detected in a specific subset of NDH-2.
The results presented are a novel contribution to the classification of the complex and large PNDR protein family, supporting its reclusterization into four classes. The linear array of motifs detected within the class IV PNDR subfamily could be useful as a signature for a particular subgroup of NDH-2.
Agile Protein Interaction DataAnalyzer (APID) is an interactive bioinformatics web tool developed to integrate and analyze in a unified and comparative platform main currently known information about protein–protein interactions demonstrated by specific small-scale or large-scale experimental methods. At present, the application includes information coming from five main source databases enclosing an unified sever to explore >35 000 different proteins and 111 000 different proven interactions. The web includes search tools to query and browse upon the data, allowing selection of the interaction pairs based in calculated parameters that weight and qualify the reliability of each given protein interaction. Such parameters are for the ‘proteins’: connectivity, cluster coefficient, Gene Ontology (GO) functional environment, GO environment enrichment; and for the ‘interactions’: number of methods, GO overlapping, iPfam domain–domain interaction. APID also includes a graphic interactive tool to visualize selected sub-networks and to navigate on them or along the whole interaction network. The application is available open access at .
In recent years, the biomolecular sciences have been driven forward by overwhelming
advances in new biotechnological high-throughput experimental methods and bioinformatic
genome-wide computational methods. Such breakthroughs are producing
huge amounts of new data that need to be carefully analysed to obtain correct and
useful scientific knowledge. One of the fields where this advance has become more
intense is the study of the network of ‘protein–protein interactions’, i.e. the ‘interactome’.
In this short review we comment on the main data and databases produced
in this field in last 5 years. We also present a rationalized scheme of biological definitions
that will be useful for a better understanding and interpretation of ‘what a
protein–protein interaction is’ and ‘which types of protein–protein interactions are
found in a living cell’. Finally, we comment on some assignments of interactome data
to defined types of protein interaction and we present a new bioinformatic tool called
APIN (Agile Protein Interaction Network browser), which is in development and will
be applied to browsing protein interaction databases.
Patients with chronic lymphocytic leukemia and 13q deletion as their only FISH abnormality could have a different outcome depending on the number of cells displaying this aberration. Thus, cases with a high number of 13q- cells (13q-H) had both shorter overall survival and time to first therapy. The goal of the study was to analyze the genetic profile of 13q-H patients.
Design and Methods:
A total of 102 samples were studied, 32 of which served as a validation cohort and five were healthy donors.
Chronic lymphocytic leukemia patients with higher percentages of 13q- cells (>80%) showed a different level of gene expression as compared to patients with lower percentages (<80%, 13q-L). This deregulation affected genes involved in apoptosis and proliferation (BCR and NFkB signaling), leading to increased proliferation and decreased apoptosis in 13q-H patients. Deregulation of several microRNAs, such as miR-15a, miR-155, miR-29a and miR-223, was also observed in these patients. In addition, our study also suggests that the gene expression pattern of 13q-H cases could be similar to the patients with 11q- or 17p-.
This study provides new evidence regarding the heterogeneity of 13q deletion in chronic lymphocytic leukemia patients, showing that apoptosis, proliferation as well as miRNA regulation are involved in cases with higher percentages of 13q- cells.
Most sporadic colorectal cancer (sCRC) deaths are caused by metastatic dissemination of the primary tumor. New advances in genetic profiling of sCRC suggest that the primary tumor may contain a cell population with metastatic potential. Here we compare the cytogenetic profile of primary tumors from liver metastatic versus non-metastatic sCRC.
We prospectively analyzed the frequency of numerical/structural abnormalities of chromosomes 1, 7, 8, 13, 14, 17, 18, 20, and 22 by iFISH in 58 sCRC patients: thirty-one non-metastatic (54%) vs. 27 metastatic (46%) disease. From a total of 18 probes, significant differences emerged only for the 17p11.2 and 22q11.2 chromosomal regions. Patients with liver metastatic sCRC showed an increased frequency of del(17p11.2) (10% vs. 67%;p<.001) and del(22q11.2) (0% vs. 22%;p = .02) versusnon-metastatic cases. Multivariate analysis of prognostic factors for overall survival (OS) showed that the only clinical and cytogenetic parameters that had an independent adverse impact on patient outcome were the presence of del(17p) with a 17p11.2 breakpoint and del(22q11.2). Based on these two cytogenetic variables, patients were classified into three groups: low- (no adverse features), intermediate- (one adverse feature) and high-risk (two adverse features)- with significantly different OS rates at 5-years (p<.001): 92%, 53% and 0%, respectively.
Our results unravel the potential implication of del(17p11.2) in sCRC patients with liver metastasis as this cytogenetic alteration appears to be intrinsically related to an increased metastatic potential and a poor outcome, providing additional prognostic information to that associated with other cytogenetic alterations such as del(22q11.2). Additional prospective studies in larger series of patients would be required to confirm the clinical utility of the new prognostic markers identified.
Transgenic expression of the MafB oncogene in haematopoietic stem/progenitor cells induces plasma cell neoplasia reminiscent of human multiple myeloma and suggests DNA methylation as cause of malignant transformation.
Understanding the cellular origin of cancer can help to improve disease prevention and therapeutics. Human plasma cell neoplasias are thought to develop from either differentiated B cells or plasma cells. However, when the expression of Maf oncogenes (associated to human plasma cell neoplasias) is targeted to mouse B cells, the resulting animals fail to reproduce the human disease. Here, to explore early cellular changes that might take place in the development of plasma cell neoplasias, we engineered transgenic mice to express MafB in haematopoietic stem/progenitor cells (HS/PCs). Unexpectedly, we show that plasma cell neoplasias arise in the MafB-transgenic mice. Beyond their clinical resemblance to human disease, these neoplasias highly express genes that are known to be upregulated in human multiple myeloma. Moreover, gene expression profiling revealed that MafB-expressing HS/PCs were more similar to B cells and tumour plasma cells than to any other subset, including wild-type HS/PCs. Consistent with this, genome-scale DNA methylation profiling revealed that MafB imposes an epigenetic program in HS/PCs, and that this program is preserved in mature B cells of MafB-transgenic mice, demonstrating a novel molecular mechanism involved in tumour initiation. Our findings suggest that, mechanistically, the haematopoietic progenitor population can be the target for transformation in MafB-associated plasma cell neoplasias.
cancer therapy; MafB; multiple myeloma mouse model; oncogenes; reprogramming stem cells
Interactome networks represent sets of possible physical interactions between proteins. They lack spatio-temporal information by construction. However, the specialized functions of the differentiated cell types which are assembled into tissues or organs depend on the combinatorial arrangements of proteins and their physical interactions. Is tissue-specificity, therefore, encoded within the interactome? In order to address this question, we combined protein-protein interactions, expression data, functional annotations and interactome topology. We first identified a subnetwork formed exclusively of proteins whose interactions were observed in all tested tissues. These are mainly involved in housekeeping functions and are located at the topological center of the interactome. This ‘Largest Common Interactome Network’ represents a ‘functional interactome core’. Interestingly, two types of tissue-specific interactions are distinguished when considering function and network topology: tissue-specific interactions involved in regulatory and developmental functions are central whereas tissue-specific interactions involved in organ physiological functions are peripheral. Overall, the functional organization of the human interactome reflects several integrative levels of functions with housekeeping and regulatory tissue-specific functions at the center and physiological tissue-specific functions at the periphery. This gradient of functions recapitulates the organization of organs, from cells to organs. Given that several gradients have already been identified across interactomes, we propose that gradients may represent a general principle of protein-protein interaction network organization.
For years, the genetics of metastatic colorectal cancer (CRC) have been studied using a variety of techniques. However, most of the approaches employed so far have a relatively limited resolution which hampers detailed characterization of the common recurrent chromosomal breakpoints as well as the identification of small regions carrying genetic changes and the genes involved in them.
Here we applied 500K SNP arrays to map the most common chromosomal lesions present at diagnosis in a series of 23 primary tumours from sporadic CRC patients who had developed liver metastasis. Overall our results confirm that the genetic profile of metastatic CRC is defined by imbalanced gains of chromosomes 7, 8q, 11q, 13q, 20q and X together with losses of the 1p, 8p, 17p and 18q chromosome regions. In addition, SNP-array studies allowed the identification of small (<1.3 Mb) and extensive/large (>1.5 Mb) altered DNA sequences, many of which contain cancer genes known to be involved in CRC and the metastatic process. Detailed characterization of the breakpoint regions for the altered chromosomes showed four recurrent breakpoints at chromosomes 1p12, 8p12, 17p11.2 and 20p12.1; interestingly, the most frequently observed recurrent chromosomal breakpoint was localized at 17p11.2 and systematically targeted the FAM27L gene, whose role in CRC deserves further investigations.
In summary, in the present study we provide a detailed map of the genetic abnormalities of primary tumours from metastatic CRC patients, which confirm and extend on previous observations as regards the identification of genes potentially involved in development of CRC and the metastatic process.