Alveolar soft part sarcoma (ASPS) is a rare, malignant mesenchymal tumor of distinctive clinical, morphologic, ultrastructural, and cytogenetical characteristics. It typically arises in the extremities of adolescents and young adults, but has also been documented in a number of unusual sites, thus causing diagnostic confusions both clinically and morphologically. The molecular signature of ASPS is a specific der(17)t(X;17)(p11.2;q25) translocation, which results in the fusion of TFE3 transcription factor gene at Xp11.2 with ASPL at 17q25. Recent studies have shown that the ASPL-TFE3 fusion transcript can be identified by reverse-transcriptase polymerase chain reaction analysis and TFE3 gene rearragement can be detected using a dual-color, break apart fluorescence in situ hybridization assay in paraffin-embedded tissue, and the resultant fusion protein can be detected immunohistochemically with antibody directed to the carboxy terminal portion of TFE3. Herein, we report a unique case of ASPS presenting as an asymptomatic mass in the lung of a 48 year-old woman without evidence of a primary soft tissue tumor elsewhere at the time of initial diagnosis. To the best of our knowledge, this is the third report of such cases appearing in the English language literature to date. We emphasize the differential diagnoses engendered by ASPS including a series of tumors involving the lung that have nested and alveolar growth patterns, and both clear and eosinophilic cytoplasm, and demonstrate the utility of molecular genetic analysis for TFE3 rearrangement and immunohistochemistry for TFE3 antigen expression for arriving at accurate diagnosis.
Alveolar soft part sarcoma; ASPS; Lung; TFE3; FISH; Differential diagnosis
Colorectal cancer is the second leading cause of mortality in men and women in the United States. While there is a definite advantage regarding the use of colonoscopies in screening, there is still a lack of widespread acceptance of colonoscopy use in the general public. This is evident by the fact that up to 75% of patients diagnosed with colorectal cancer present with locally advanced disease. In order to make colonoscopy and in turn colorectal cancer screening a patient friendly and a comfortable test some changes in tool are necessary. The conventional colonoscope has not changed much since its development. There are several new advances in colorectal screening practices. One of the most promising new advances is the advent of robotic endoscopic techniques.
New discoveries in the last decade significantly altered our view on mitochondria. They are no longer viewed as energy-making slaves but rather individual cells-within-the-cell. In particular, it has been suggested that many important cellular mechanisms involving specific enzymes and ion channels, such as nitric oxide synthase (NOS), ATP-dependent K+ (KATP) channels, and poly-(APD-ribose) polymerase (PARP), have a distinct, mitochondrial variant. Unfortunately, exploring these parallel systems in mitochondria have technical limitations and inappropriate methods often led to inconsistent results. For example, the intriguing possibility that mitochondria are significant sources of nitric oxide (NO) via a unique mitochondrial NOS variant has attracted intense interest among research groups because of the potential for NO to affect functioning of the electron transport chain. Nonetheless, conclusive evidence concerning the existence of mitochondrial NO synthesis is yet to be presented. This review summarizes the experimental evidence gathered over the last decade in this field and highlights new areas of research that reveal surprising dimensions of NO production and metabolism by mitochondria.
Mitochondrion; Nitric Oxide Synthase; Peroxynitrite; Reactive Nitrogen Species; Nitric Oxide; Electron Transport System; Peroxynitrite
Adverse event (AE) surveillance may be enhanced by the Institute for Healthcare Improvement’s Global Trigger Tool (GTT). A pilot study of the GTT was conducted in one Veterans Health Administration (VA) facility to assess the rates, types, and harm of AEs detected, and to examine the overlap in AE detection between the GTT and existing surveillance mechanisms.
GTT guidelines were followed and medical charts were reviewed for 17 weeks of acute-care hospitalizations. Investigators met monthly, first to adjudicate discordant reviewer categorizations of harm and later to categorize the AEs detected using standardized definitions. GTT-detected AEs were compared with incident reports, Patient Safety Indicators, and the VA Surgical Quality Improvement Program.
Medical charts were reviewed for 273 cases out of 1,980 eligible cases. Using the GTT, a total of 109 AEs were identified. More than 1 out of 5 hospitalizations (21%) were associated with an AE. The majority of AEs detected (60%) were minor harms; there were no deaths attributable to medical care. Ninety-six of the 109 AEs (88%) were not detected by other measures.
The GTT identified previously undetected AEs at one VA. The GTT has the potential to track AEs and guide quality improvement efforts in conjunction with existing AE surveillance mechanisms.
quality measurement; trigger tools; health services research; patient safety; harm; adverse event epidemiology/detection
Clostridium difficile is an anaerobic Gram-positive bacterium that causes intestinal infections with symptoms ranging from mild diarrhea to fulminant colitis. Cyclic diguanosine monophosphate (c-di-GMP) is a bacterial second messenger that typically regulates the switch from motile, free-living to sessile and multicellular behaviors in Gram-negative bacteria. Increased intracellular c-di-GMP concentration in C. difficile was recently shown to reduce flagellar motility and to increase cell aggregation. In this work, we investigated the role of the primary type IV pilus (T4P) locus in c-di-GMP-dependent cell aggregation. Inactivation of two T4P genes, pilA1 (CD3513) and pilB1 (CD3512), abolished pilus formation and significantly reduced cell aggregation under high c-di-GMP conditions. pilA1 is preceded by a putative c-di-GMP riboswitch, predicted to be transcriptionally active upon c-di-GMP binding. Consistent with our prediction, high intracellular c-di-GMP concentration increased transcript levels of T4P genes. In addition, single-round in vitro transcription assays confirmed that transcription downstream of the predicted transcription terminator was dose dependent and specific to c-di-GMP binding to the riboswitch aptamer. These results support a model in which T4P gene transcription is upregulated by c-di-GMP as a result of its binding to an upstream transcriptionally activating riboswitch, promoting cell aggregation in C. difficile.
With the availability of gene expression data by RNA-seq, powerful statistical approaches for grouping similar gene expression profiles across different environments have become increasingly important. We describe and assess a computational model for clustering genes into distinct groups based on the pattern of gene expression in response to changing environment. The model capitalizes on the Poisson distribution to capture the count property of RNA-seq data. A two-stage hierarchical expectation–maximization (EM) algorithm is implemented to estimate an optimal number of groups and mean expression amounts of each group across two environments. A procedure is formulated to test whether and how a given group shows a plastic response to environmental changes. The impact of gene–environment interactions on the phenotypic plasticity of the organism can also be visualized and characterized. The model was used to analyse an RNA-seq dataset measured from two cell lines of breast cancer that respond differently to an anti-cancer drug, from which genes associated with the resistance and sensitivity of the cell lines are identified. We performed simulation studies to validate the statistical behaviour of the model. The model provides a useful tool for clustering gene expression data by RNA-seq, facilitating our understanding of gene functions and networks.
RNA-seq; Poisson distribution; EM algorithm; breast cancer cell lines
With the development of next-generation sequencing (NGS) technologies, a large amount of short read data has been generated. Assembly of these short reads can be challenging for genomes and metagenomes without template sequences, making alignment-based genome sequence comparison difficult. In addition, sequence reads from NGS can come from different regions of various genomes and they may not be alignable. Sequence signature-based methods for genome comparison based on the frequencies of word patterns in genomes and metagenomes can potentially be useful for the analysis of short reads data from NGS. Here we review the recent development of alignment-free genome and metagenome comparison based on the frequencies of word patterns with emphasis on the dissimilarity measures between sequences, the statistical power of these measures when two sequences are related and the applications of these measures to NGS data.
alignment-free; word patterns; Markov model; genome comparison; statistical power; NGS data
Among alignment-free methods, Iterated Maps (IMs) are on a particular extreme: they are also scale free (order free). The use of IMs for sequence analysis is also distinct from other alignment-free methodologies in being rooted in statistical mechanics instead of computational linguistics. Both of these roots go back over two decades to the use of fractal geometry in the characterization of phase-space representations. The time series analysis origin of the field is betrayed by the title of the manuscript that started this alignment-free subdomain in 1990, ‘Chaos Game Representation’. The clash between the analysis of sequences as continuous series and the better established use of Markovian approaches to discrete series was almost immediate, with a defining critique published in same journal 2 years later. The rest of that decade would go by before the scale-free nature of the IM space was uncovered. The ensuing decade saw this scalability generalized for non-genomic alphabets as well as an interest in its use for graphic representation of biological sequences. Finally, in the past couple of years, in step with the emergence of BigData and MapReduce as a new computational paradigm, there is a surprising third act in the IM story. Multiple reports have described gains in computational efficiency of multiple orders of magnitude over more conventional sequence analysis methodologies. The stage appears to be now set for a recasting of IMs with a central role in processing nextgen sequencing results.
sequence analysis; iterated maps; chaos game; mapreduce; big data; alignment-free
Epigenetic mechanisms play an important role in the regulation of cell type-specific gene activities, yet how epigenetic patterns are established and maintained remains poorly understood. Recent studies have supported a role of DNA sequences in recruitment of epigenetic regulators. Alignment-free methods have been applied to identify distinct sequence features that are associated with epigenetic patterns and to predict epigenomic profiles. Here, we review recent advances in such applications, including the methods to map DNA sequence to feature space, sequence comparison and prediction models. Computational studies using these methods have provided important insights into the epigenetic regulatory mechanisms.
epigenetics; nucleosome; DNA sequence; alignment-free method; machine learning
Epigenetic modifications may play an important role in the formation and progression of complex diseases through the regulation of gene expression. The systematic identification of epigenetic variants that contribute to human diseases can be made possible using genome-wide association studies (GWAS), although epigenetic effects are currently not included in commonly used case–control designs for GWAS. Here, we show that epigenetic modifications can be integrated into a case–control setting by dissolving the overall genetic effect into its different components, additive, dominant and epigenetic. We describe a general procedure for testing and estimating the significance of each component based on a conventional chi-squared test approach. Simulation studies were performed to investigate the power and false-positive rate of this procedure, providing recommendations for its practical use. The integration of epigenetic variants into GWAS can potentially improve our understanding of how genetic, environmental and stochastic factors interact with epialleles to construct the genetic architecture of complex diseases.
Case-control design; epigenetic effect; quantitative genetics
We compared the performance of template-free (docking) and template-based methods for the prediction of protein–protein complex structures. We found similar performance for a template-based method based on threading (COTH) and another template-based method based on structural alignment (PRISM). The template-based methods showed similar performance to a docking method (ZDOCK) when the latter was allowed one prediction for each complex, but when the same number of predictions was allowed for each method, the docking approach outperformed template-based approaches. We identified strengths and weaknesses in each method. Template-based approaches were better able to handle complexes that involved conformational changes upon binding. Furthermore, the threading-based and docking methods were better than the structural-alignment-based method for enzyme–inhibitor complex prediction. Finally, we show that the near-native (correct) predictions were generally not shared by the various approaches, suggesting that integrating their results could be the superior strategy.
protein–protein structure; template-based prediction; protein–protein docking; ZDOCK; PRISM; COTH
The formation of phenotypic traits, such as biomass production, tumor volume and viral abundance, undergoes a complex process in which interactions between genes and developmental stimuli take place at each level of biological organization from cells to organisms. Traditional studies emphasize the impact of genes by directly linking DNA-based markers with static phenotypic values. Functional mapping, derived to detect genes that control developmental processes using growth equations, has proven powerful for addressing questions about the roles of genes in development. By treating phenotypic formation as a cohesive system using differential equations, a different approach—systems mapping—dissects the system into interconnected elements and then map genes that determine a web of interactions among these elements, facilitating our understanding of the genetic machineries for phenotypic development. Here, we argue that genetic mapping can play a more important role in studying the genotype–phenotype relationship by filling the gaps in the biochemical and regulatory process from DNA to end-point phenotype. We describe a new framework, named network mapping, to study the genetic architecture of complex traits by integrating the regulatory networks that cause a high-order phenotype. Network mapping makes use of a system of differential equations to quantify the rule by which transcriptional, proteomic and metabolomic components interact with each other to organize into a functional whole. The synthesis of functional mapping, systems mapping and network mapping provides a novel avenue to decipher a comprehensive picture of the genetic landscape of complex phenotypes that underlie economically and biomedically important traits.
network mappin; complex traits; differential equations; DNA polymorphism; systems biology
Despite considerable progress in the past decades, protein structure prediction remains one of the major unsolved problems in computational biology. Angular-sampling-based methods have been extensively studied recently due to their ability to capture the continuous conformational space of protein structures. The literature has focused on using a variety of parametric models of the sequential dependencies between angle pairs along the protein chains. In this article, we present a thorough review of angular-sampling-based methods by assessing three main questions: What is the best distribution type to model the protein angles? What is a reasonable number of components in a mixture model that should be considered to accurately parameterize the joint distribution of the angles? and What is the order of the local sequence–structure dependency that should be considered by a prediction method? We assess the model fits for different methods using bivariate lag-distributions of the dihedral/planar angles. Moreover, the main information across the lags can be extracted using a technique called Lag singular value decomposition (LagSVD), which considers the joint distribution of the dihedral/planar angles over different lags using a nonparametric approach and monitors the behavior of the lag-distribution of the angles using singular value decomposition. As a result, we developed graphical tools and numerical measurements to compare and evaluate the performance of different model fits. Furthermore, we developed a web-tool (http://www.stat.tamu.edu/∼madoliat/LagSVD) that can be used to produce informative animations.
protein conformational sampling; parametric models; assessment tools; hidden Markov models; principal component analysis; dihedral and planar angles
A number of bioinformatic or biostatistical methods are available for analyzing DNA copy number profiles measured from microarray or sequencing technologies. In the absence of rich enough gold standard data sets, the performance of these methods is generally assessed using unrealistic simulation studies, or based on small real data analyses. To make an objective and reproducible performance assessment, we have designed and implemented a framework to generate realistic DNA copy number profiles of cancer samples with known truth. These profiles are generated by resampling publicly available SNP microarray data from genomic regions with known copy-number state. The original data have been extracted from dilutions series of tumor cell lines with matched blood samples at several concentrations. Therefore, the signal-to-noise ratio of the generated profiles can be controlled through the (known) percentage of tumor cells in the sample. This article describes this framework and its application to a comparison study between methods for segmenting DNA copy number profiles from SNP microarrays. This study indicates that no single method is uniformly better than all others. It also helps identifying pros and cons of the compared methods as a function of biologically informative parameters, such as the fraction of tumor cells in the sample and the proportion of heterozygous markers. This comparison study may be reproduced using the open source and cross-platform R package jointseg, which implements the proposed data generation and evaluation framework: http://r-forge.r-project.org/R/?group_id=1562.
DNA copy number; segmentation; realistic data generation; performance evaluation
Advancements in high-throughput nucleotide sequencing techniques have brought with them state-of-the-art bioinformatics programs and software packages. Given the importance of molecular sequence data in contemporary life science research, these software suites are becoming an essential component of many labs and classrooms, and as such are frequently designed for non-computer specialists and marketed as one-stop bioinformatics toolkits. Although beautifully designed and powerful, user-friendly bioinformatics packages can be expensive and, as more arrive on the market each year, it can be difficult for researchers, teachers and students to choose the right software for their needs, especially if they do not have a bioinformatics background. This review highlights some of the currently available and most popular commercial bioinformatics packages, discussing their prices, usability, features and suitability for teaching. Although several commercial bioinformatics programs are arguably overpriced and overhyped, many are well designed, sophisticated and, in my opinion, worth the investment. If you are just beginning your foray into molecular sequence analysis or an experienced genomicist, I encourage you to explore proprietary software bundles. They have the potential to streamline your research, increase your productivity, energize your classroom and, if anything, add a bit of zest to the often dry detached world of bioinformatics.
bioinformatics software; CLC bio; Geneious; genome assembly; nucleotide alignment; phylogenetics software
Amino acid repeats (AARs) are abundant in protein sequences. They have particular roles in protein function and evolution. Simple repeat patterns generated by DNA slippage tend to introduce length variations and point mutations in repeat regions. Loss of normal and gain of abnormal function owing to their variable length are potential risks leading to diseases. Repeats with complex patterns mostly refer to the functional domain repeats, such as the well-known leucine-rich repeat and WD repeat, which are frequently involved in protein–protein interaction. They are mainly derived from internal gene duplication events and stabilized by ‘gate-keeper’ residues, which play crucial roles in preventing inter-domain aggregation. AARs are widely distributed in different proteomes across a variety of taxonomic ranges, and especially abundant in eukaryotic proteins. However, their specific evolutionary and functional scenarios are still poorly understood. Identifying AARs in protein sequences is the first step for the further investigation of their biological function and evolutionary mechanism. In principle, this is an NP-hard problem, as most of the repeat fragments are shaped by a series of sophisticated evolutionary events and become latent periodical patterns. It is not possible to define a uniform criterion for detecting and verifying various repeat patterns. Instead, different algorithms based on different strategies have been developed to cope with different repeat patterns. In this review, we attempt to describe the amino acid repeat-detection algorithms currently available and compare their strategies based on an in-depth analysis of the biological significance of protein repeats.
amino acid repeat; detection algorithm; low complexity sequence; repeat containing protein; protein domain repeats
The discipline of bioinformatics has developed rapidly since the complete sequencing of the first genomes in the 1990s. The development of many high-throughput techniques during the last decades has ensured that bioinformatics has grown into a discipline that overlaps with, and is required for, the modern practice of virtually every field in the life sciences. This has placed a scientific premium on the availability of skilled bioinformaticians, a qualification that is extremely scarce on the African continent. The reasons for this are numerous, although the absence of a skilled bioinformatician at academic institutions to initiate a training process and build sustained capacity seems to be a common African shortcoming. This dearth of bioinformatics expertise has had a knock-on effect on the establishment of many modern high-throughput projects at African institutes, including the comprehensive and systematic analysis of genomes from African populations, which are among the most genetically diverse anywhere on the planet. Recent funding initiatives from the National Institutes of Health and the Wellcome Trust are aimed at ameliorating this shortcoming. In this paper, we discuss the problems that have limited the establishment of the bioinformatics field in Africa, as well as propose specific actions that will help with the education and training of bioinformaticians on the continent. This is an absolute requirement in anticipation of a boom in high-throughput approaches to human health issues unique to data from African populations.
bioinformatics education; bioinformatics in Africa; postgraduate program
Gene fusions are important genomic events in human cancer because their fusion gene products can drive the development of cancer and thus are potential prognostic tools or therapeutic targets in anti-cancer treatment. Major advancements have been made in computational approaches for fusion gene discovery over the past 3 years due to improvements and widespread applications of high-throughput next generation sequencing (NGS) technologies. To identify fusions from NGS data, existing methods typically leverage the strengths of both sequencing technologies and computational strategies. In this article, we review the NGS and computational features of existing methods for fusion gene detection and suggest directions for future development.
gene fusion; next generation sequencing; cancer; whole genome sequencing; transcriptome sequencing; computational tools
The need to analyze high-dimension biological data is driving the development of new data mining methods. Biclustering algorithms have been successfully applied to gene expression data to discover local patterns, in which a subset of genes exhibit similar expression levels over a subset of conditions. However, it is not clear which algorithms are best suited for this task. Many algorithms have been published in the past decade, most of which have been compared only to a small number of algorithms. Surveys and comparisons exist in the literature, but because of the large number and variety of biclustering algorithms, they are quickly outdated. In this article we partially address this problem of evaluating the strengths and weaknesses of existing biclustering methods. We used the BiBench package to compare 12 algorithms, many of which were recently published or have not been extensively studied. The algorithms were tested on a suite of synthetic data sets to measure their performance on data with varying conditions, such as different bicluster models, varying noise, varying numbers of biclusters and overlapping biclusters. The algorithms were also tested on eight large gene expression data sets obtained from the Gene Expression Omnibus. Gene Ontology enrichment analysis was performed on the resulting biclusters, and the best enrichment terms are reported. Our analyses show that the biclustering method and its parameters should be selected based on the desired model, whether that model allows overlapping biclusters, and its robustness to noise. In addition, we observe that the biclustering algorithms capable of finding more than one model are more successful at capturing biologically relevant clusters.
biclustering; microarray; gene expression; clustering
Glycosylation of proteins is involved in immune defense, cell–cell adhesion, cellular recognition and pathogen binding and is one of the most common and complex post-translational modifications. Science is still struggling to assign detailed mechanisms and functions to this form of conjugation. Even the structural analysis of glycoproteins—glycoproteomics—remains in its infancy due to the scarcity of high-throughput analytical platforms capable of determining glycopeptide composition and structure, especially platforms for complex biological mixtures. Glycopeptide composition and structure can be determined with high mass-accuracy mass spectrometry, particularly when combined with chromatographic separation, but the sheer volume of generated data necessitates computational software for interpretation. This review discusses the current state of glycopeptide assignment software—advances made to date and issues that remain to be addressed. The various software and algorithms developed so far provide important insights into glycoproteomics. However, there is currently no freely available software that can analyze spectral data in batch and unambiguously determine glycopeptide compositions for N- and O-linked glycopeptides from relevant biological sources such as human milk and serum. Few programs are capable of aiding in structural determination of the glycan component. To significantly advance the field of glycoproteomics, analytical software and algorithms are required that: (i) solve for both N- and O-linked glycopeptide compositions, structures and glycosites in biological mixtures; (ii) are high-throughput and process data in batches; (iii) can interpret mass spectral data from a variety of sources and (iv) are open source and freely available.
glycopeptide; glycoproteomics; glycopeptidomics; bioinformatics; N-linked; O-linked
A number of supervised machine learning models have recently been introduced for the prediction of drug–target interactions based on chemical structure and genomic sequence information. Although these models could offer improved means for many network pharmacology applications, such as repositioning of drugs for new therapeutic uses, the prediction models are often being constructed and evaluated under overly simplified settings that do not reflect the real-life problem in practical applications. Using quantitative drug–target bioactivity assays for kinase inhibitors, as well as a popular benchmarking data set of binary drug–target interactions for enzyme, ion channel, nuclear receptor and G protein-coupled receptor targets, we illustrate here the effects of four factors that may lead to dramatic differences in the prediction results: (i) problem formulation (standard binary classification or more realistic regression formulation), (ii) evaluation data set (drug and target families in the application use case), (iii) evaluation procedure (simple or nested cross-validation) and (iv) experimental setting (whether training and test sets share common drugs and targets, only drugs or targets or neither). Each of these factors should be taken into consideration to avoid reporting overoptimistic drug–target interaction prediction results. We also suggest guidelines on how to make the supervised drug–target interaction prediction studies more realistic in terms of such model formulations and evaluation setups that better address the inherent complexity of the prediction task in the practical applications, as well as novel benchmarking data sets that capture the continuous nature of the drug–target interactions for kinase inhibitors.
drug–target interaction; kinase bioactivity assays; nested cross-validation; predictive modeling; supervised machine learning
In an interesting and quite exhaustive review on Random Forests (RF) methodology in bioinformatics Touw et al. address—among other topics—the problem of the detection of interactions between variables based on RF methodology. We feel that some important statistical concepts, such as ‘interaction’, ‘conditional dependence’ or ‘correlation’, are sometimes employed inconsistently in the bioinformatics literature in general and in the literature on RF in particular. In this letter to the Editor, we aim to clarify some of the central statistical concepts and point out some confusing interpretations concerning RF given by Touw et al. and other authors.
random forest; statistics; interaction; correlation; conditional inference trees; conditional variable importance
The rise of personalized medicine and the availability of high-throughput molecular analyses in the context of clinical care have increased the need for adequate tools for translational researchers to manage and explore these data. We reviewed the biomedical literature for translational platforms allowing the management and exploration of clinical and omics data, and identified seven platforms: BRISK, caTRIP, cBio Cancer Portal, G-DOC, iCOD, iDASH and tranSMART. We analyzed these platforms along seven major axes. (1) The community axis regrouped information regarding initiators and funders of the project, as well as availability status and references. (2) We regrouped under the information content axis the nature of the clinical and omics data handled by each system. (3) The privacy management environment axis encompassed functionalities allowing control over data privacy. (4) In the analysis support axis, we detailed the analytical and statistical tools provided by the platforms. We also explored (5) interoperability support and (6) system requirements. The final axis (7) platform support listed the availability of documentation and installation procedures. A large heterogeneity was observed in regard to the capability to manage phenotype information in addition to omics data, their security and interoperability features. The analytical and visualization features strongly depend on the considered platform. Similarly, the availability of the systems is variable. This review aims at providing the reader with the background to choose the platform best suited to their needs. To conclude, we discuss the desiderata for optimal translational research platforms, in terms of privacy, interoperability and technical features.
translational medical research; biomedical research; clinical data; high-throughput technologies; information storage and retrieval