Decoding DNA symbols using next-generation sequencers was a major breakthrough in genomic research. Despite the many advantages of next-generation sequencers, e.g., the high-throughput sequencing rate and relatively low cost of sequencing, the assembly of the reads produced by these sequencers still remains a major challenge. In this review, we address the basic framework of next-generation genome sequence assemblers, which comprises four basic stages: preprocessing filtering, a graph construction process, a graph simplification process, and postprocessing filtering. Here we discuss them as a framework of four stages for data analysis and processing and survey variety of techniques, algorithms, and software tools used during each stage. We also discuss the challenges that face current assemblers in the next-generation environment to determine the current state-of-the-art. We recommend a layered architecture approach for constructing a general assembler that can handle the sequences generated by different sequencing platforms.
Small drug molecules usually bind to multiple protein targets or even unintended off-targets. Such drug promiscuity has often led to unwanted or unexplained drug reactions, resulting in side effects or drug repositioning opportunities. So it is always an important issue in pharmacology to identify potential drug-target interactions (DTI). However, DTI discovery by experiment remains a challenging task, due to high expense of time and resources. Many computational methods are therefore developed to predict DTI with high throughput biological and clinical data. Here, we initiatively demonstrate that the on-target and off-target effects could be characterized by drug-induced in vitro genomic expression changes, e.g. the data in Connectivity Map (CMap). Thus, unknown ligands of a certain target can be found from the compounds showing high gene-expression similarity to the known ligands. Then to clarify the detailed practice of CMap based DTI prediction, we objectively evaluate how well each target is characterized by CMap. The results suggest that (1) some targets are better characterized than others, so the prediction models specific to these well characterized targets would be more accurate and reliable; (2) in some cases, a family of ligands for the same target tend to interact with common off-targets, which may help increase the efficiency of DTI discovery and explain the mechanisms of complicated drug actions. In the present study, CMap expression similarity is proposed as a novel indicator of drug-target interactions. The detailed strategies of improving data quality by decreasing the batch effect and building prediction models are also effectively established. We believe the success in CMap can be further translated into other public and commercial data of genomic expression, thus increasing research productivity towards valid drug repositioning and minimal side effects.
Small drug molecules usually bind to unintended off-targets, leading to unexpected drug responses such as side effects or drug repositioning opportunities. Thus, identifying unintended drug-target interactions (DTI) is particularly required for understanding complicated drug actions. It remains expensive nowadays to experimentally determine DTI, so various computational methods are developed. In this study, we initiatively demonstrated that target binding is directly correlated with drug induced genomic expression profiles in Connectivity Map (CMap). By improving data quality of CMap, we illustrated three important facts: (1) Drugs binding to common targets show higher gene-expression similarity than random compounds, indicating that upstream ligand binding could be characterized by downstream gene-expression change. (2) It is found that some targets are better characterized by CMap than others. To guarantee efficiency of DTI discovery, prediction models should be specifically built for those well characterized targets. (3) It is broadly observed in the predicted DTI that ligands for the same target may collectively interact with common off-target. This observation is consistent with published experimental evidence and can help illustrate the mechanisms of unexplained drug reactions. Based on CMap, our work established an efficient pipeline of identifying potential DTI. By extending the success in CMap to other genomic data sources, we believe more DTI would be discovered.
The field of medical systems biology aims to advance understanding of molecular mechanisms that drive disease progression and to translate this knowledge into therapies to effectively treat diseases. A challenging task is the investigation of long-term effects of a (pharmacological) treatment, to establish its applicability and to identify potential side effects. We present a new modeling approach, called Analysis of Dynamic Adaptations in Parameter Trajectories (ADAPT), to analyze the long-term effects of a pharmacological intervention. A concept of time-dependent evolution of model parameters is introduced to study the dynamics of molecular adaptations. The progression of these adaptations is predicted by identifying necessary dynamic changes in the model parameters to describe the transition between experimental data obtained during different stages of the treatment. The trajectories provide insight in the affected underlying biological systems and identify the molecular events that should be studied in more detail to unravel the mechanistic basis of treatment outcome. Modulating effects caused by interactions with the proteome and transcriptome levels, which are often less well understood, can be captured by the time-dependent descriptions of the parameters. ADAPT was employed to identify metabolic adaptations induced upon pharmacological activation of the liver X receptor (LXR), a potential drug target to treat or prevent atherosclerosis. The trajectories were investigated to study the cascade of adaptations. This provided a counter-intuitive insight concerning the function of scavenger receptor class B1 (SR-B1), a receptor that facilitates the hepatic uptake of cholesterol. Although activation of LXR promotes cholesterol efflux and -excretion, our computational analysis showed that the hepatic capacity to clear cholesterol was reduced upon prolonged treatment. This prediction was confirmed experimentally by immunoblotting measurements of SR-B1 in hepatic membranes. Next to the identification of potential unwanted side effects, we demonstrate how ADAPT can be used to design new target interventions to prevent these.
A driving ambition of medical systems biology is to advance our understanding of molecular processes that drive the progression of complex diseases such as Type 2 Diabetes and cardiovascular disease. This insight is essential to enable the development of therapies to effectively treat diseases. A challenging task is to investigate the long-term effects of a treatment, in order to establish its applicability and to identify potential side effects. As such, there is a growing need for novel approaches to support this research. Here, we present a new computational approach to identify treatment effects. We make use of a computational model of the biological system. The model is used to describe the experimental data obtained during different stages of the treatment. To incorporate the long-term/progressive adaptations in the system, induced by changes in gene and protein expression, the model is iteratively updated. The approach was employed to identify metabolic adaptations induced by a potential anti-atherosclerotic and anti-diabetic drug target. Our approach identifies the molecular events that should be studied in more detail to establish the mechanistic basis of treatment outcome. New biological insight was obtained concerning the metabolism of cholesterol, which was in turn experimentally validated.
A molecular device that records time-varying signals would enable new approaches in neuroscience. We have recently proposed such a device, termed a “molecular ticker tape”, in which an engineered DNA polymerase (DNAP) writes time-varying signals into DNA in the form of nucleotide misincorporation patterns. Here, we define a theoretical framework quantifying the expected capabilities of molecular ticker tapes as a function of experimental parameters. We present a decoding algorithm for estimating time-dependent input signals, and DNAP kinetic parameters, directly from misincorporation rates as determined by sequencing. We explore the requirements for accurate signal decoding, particularly the constraints on (1) the polymerase biochemical parameters, and (2) the amplitude, temporal resolution, and duration of the time-varying input signals. Our results suggest that molecular recording devices with kinetic properties similar to natural polymerases could be used to perform experiments in which neural activity is compared across several experimental conditions, and that devices engineered by combining favorable biochemical properties from multiple known polymerases could potentially measure faster phenomena such as slow synchronization of neuronal oscillations. Sophisticated engineering of DNAPs is likely required to achieve molecular recording of neuronal activity with single-spike temporal resolution over experimentally relevant timescales.
Recording of physiological signals from inaccessible microenvironments is often hampered by the macroscopic sizes of current recording devices. A signal-recording device constructed on a molecular scale could advance biology by enabling the simultaneous recording from millions or billions of cells. We recently proposed a molecular device for recording time-varying ion concentration signals: DNA polymerases (DNAPs) copy known template DNA strands with an error rate dependent on the local ion concentration. The resulting DNA polymers could then be sequenced, and with the help of statistical techniques, used to estimate the time-varying ion concentration signal experienced by the polymerase. We develop a statistical framework to treat this inverse problem and describe a technique to decode the ion concentration signals from DNA sequencing data. We also provide a novel method for estimating properties of DNAP dynamics, such as polymerization rate and pause frequency, directly from sequencing data. We use this framework to explore potential application scenarios for molecular recording devices, achievable via molecular engineering within the biochemical parameter ranges of known polymerases. We find that accurate recording of neural firing rate responses across several experimental conditions would likely be feasible using molecular recording devices with kinetic properties similar to those of known polymerases.
We compare the sets of experimentally validated long intergenic non-coding (linc)RNAs from human and mouse and apply a maximum likelihood approach to estimate the total number of lincRNA genes as well as the size of the conserved part of the lincRNome. Under the assumption that the sets of experimentally validated lincRNAs are random samples of the lincRNomes of the corresponding species, we estimate the total lincRNome size at approximately 40,000 to 50,000 species, at least twice the number of protein-coding genes. We further estimate that the fraction of the human and mouse euchromatic genomes encoding lincRNAs is more than twofold greater than the fraction of protein-coding sequences. Although the sequences of most lincRNAs are much less strongly conserved than protein sequences, the extent of orthology between the lincRNomes is unexpectedly high, with 60 to 70% of the lincRNA genes shared between human and mouse. The orthologous mammalian lincRNAs can be predicted to perform equivalent functions; accordingly, it appears likely that thousands of evolutionarily conserved functional roles of lincRNAs remain to be characterized.
Genome analysis of humans and other mammals reveals a surprisingly small number of protein-coding genes, only slightly over 20,000 (although the diversity of actual proteins is substantially augmented by alternative transcription and alternative splicing). Recent analysis of the mammalian genomes and transcriptomes, in particular, using the RNAseq technology, shows that, in addition to protein-coding genes, mammalian genomes encode many long non-coding RNAs. For some of these transcripts, various regulatory functions have been demonstrated, but on the whole the repertoire of long non-coding RNAs remains poorly characterized. We compared the identified long intergenic non-coding (linc)RNAs from human and mouse, and employed a specially developed statistical technique to estimate the size and evolutionary conservation of the human and mouse lincRNomes. The estimates show that there are at least twice as many human and mouse lincRNAs than there are protein-coding genes. Moreover, about two third of the lincRNA genes appear to be conserved between human and mouse, implying thousands of conserved but still uncharacterized functions.
The present work exemplifies how parameter identifiability analysis can be used to gain insights into differences in experimental systems and how uncertainty in parameter estimates can be handled. The case study, presented here, investigates interferon-gamma (IFNγ) induced STAT1 signalling in two cell types that play a key role in pancreatic cancer development: pancreatic stellate and cancer cells. IFNγ inhibits the growth for both types of cells and may be prototypic of agents that simultaneously hit cancer and stroma cells. We combined time-course experiments with mathematical modelling to focus on the common situation in which variations between profiles of experimental time series, from different cell types, are observed. To understand how biochemical reactions are causing the observed variations, we performed a parameter identifiability analysis. We successfully identified reactions that differ in pancreatic stellate cells and cancer cells, by comparing confidence intervals of parameter value estimates and the variability of model trajectories. Our analysis shows that useful information can also be obtained from nonidentifiable parameters. For the prediction of potential therapeutic targets we studied the consequences of uncertainty in the values of identifiable and nonidentifiable parameters. Interestingly, the sensitivity of model variables is robust against parameter variations and against differences between IFNγ induced STAT1 signalling in pancreatic stellate and cancer cells. This provides the basis for a prediction of therapeutic targets that are valid for both cell types.
For the prediction of therapeutic targets and the design of therapies, it is important to study the same pathway across different cell types. This is particularly relevant for cancer research, where several cell types are involved in carcinogenesis. Pancreatic cancer is enhanced by activated pancreatic stellate cells. It would thus seem plausible for an effective therapy to hit stellate and cancer cells. The cytokine IFNγ is an inhibitor of proliferation in both cell types. Antiproliferative effects of IFNγ are mediated by STAT1 signalling. An important aspect is to determine those reactions that cause the differences in the initial increase of phosphorylated STAT1 and in the temporal profile of STAT1 nuclear accumulation between the two cell types. We examined this aspect by performing a parameter identifiability analysis for calibrated mathematical models. We calculated confidence intervals of the estimated parameter values and found that they provide insights into reactions underlying the differences. A key finding of sensitivity analysis elucidated that predicted targets for enhancement of STAT1 activity are robust against parameter uncertainty and moreover they are robust between the two cell types. Our case study therefore exemplified how identifiability and sensitivity analysis can provide a basis for the prediction of potential therapeutic targets.
Next generation sequencing (NGS) has enabled high throughput discovery of somatic mutations. Detection depends on experimental design, lab platforms, parameters and analysis algorithms. However, NGS-based somatic mutation detection is prone to erroneous calls, with reported validation rates near 54% and congruence between algorithms less than 50%. Here, we developed an algorithm to assign a single statistic, a false discovery rate (FDR), to each somatic mutation identified by NGS. This FDR confidence value accurately discriminates true mutations from erroneous calls. Using sequencing data generated from triplicate exome profiling of C57BL/6 mice and B16-F10 melanoma cells, we used the existing algorithms GATK, SAMtools and SomaticSNiPer to identify somatic mutations. For each identified mutation, our algorithm assigned an FDR. We selected 139 mutations for validation, including 50 somatic mutations assigned a low FDR (high confidence) and 44 mutations assigned a high FDR (low confidence). All of the high confidence somatic mutations validated (50 of 50), none of the 44 low confidence somatic mutations validated, and 15 of 45 mutations with an intermediate FDR validated. Furthermore, the assignment of a single FDR to individual mutations enables statistical comparisons of lab and computation methodologies, including ROC curves and AUC metrics. Using the HiSeq 2000, single end 50 nt reads from replicates generate the highest confidence somatic mutation call set.
Next generation sequencing (NGS) has enabled unbiased, high throughput discovery of genetic variations and somatic mutations. However, the NGS platform is still prone to errors resulting in inaccurate mutation calls. A statistical measure of the confidence of putative mutation calls would enable researchers to prioritize and select mutations in a robust manner. Here we present our development of a confidence score for mutations calls and apply the method to the identification of somatic mutations in B16 melanoma. We use NGS exome resequencing to profile triplicates of both the reference C57BL/6 mice and the B16-F10 melanoma cells. These replicate data allow us to formulate the false discovery rate of somatic mutations as a statistical quantity. Using this method, we show that 50 of 50 high confidence mutation calls are correct while 0 of 44 low confidence mutations are correct, demonstrating that the method is able to correctly rank mutation calls.
The serotonin 2C receptor (5-HT2CR)–a key regulator of diverse neurological processes–exhibits functional variability derived from editing of its pre-mRNA by site-specific adenosine deamination (A-to-I pre-mRNA editing) in five distinct sites. Here we describe a statistical technique that was developed for analysis of the dependencies among the editing states of the five sites. The statistical significance of the observed correlations was estimated by comparing editing patterns in multiple individuals. For both human and rat 5-HT2CR, the editing states of the physically proximal sites A and B were found to be strongly dependent. In contrast, the editing states of sites C and D, which are also physically close, seem not to be directly dependent but instead are linked through the dependencies on sites A and B, respectively. We observed pronounced differences between the editing patterns in humans and rats: in humans site A is the key determinant of the editing state of the other sites, whereas in rats this role belongs to site B. The structure of the dependencies among the editing sites is notably simpler in rats than it is in humans implying more complex regulation of 5-HT2CR editing and, by inference, function in the human brain. Thus, exhaustive statistical analysis of the 5-HT2CR editing patterns indicates that the editing state of sites A and B is the primary determinant of the editing states of the other three sites, and hence the overall editing pattern. Taken together, these findings allow us to propose a mechanistic model of concerted action of ADAR1 and ADAR2 in 5-HT2CR editing. Statistical approach developed here can be applied to other cases of interdependencies among modification sites in RNA and proteins.
The serotonin receptor 2C is a key regulator of diverse neurological processes that affect feeding behavior, sleep, sexual behavior, anxiety and depression. The function of the receptor itself is regulated via so-called pre-mRNA editing, i.e. site-specific adenosine deamination in five distinct sites. The greater the number of edited sites in the serotonin receptor mRNA, the lower the activity of the receptor it encodes. Here we used the results of extensive massively parallel sequencing from human and rat brains to elucidate the dependencies among the editing states of the five sites. Despite the apparent simplicity of the problem, disambiguation of these dependencies is a difficult task that required development of a new statistical technique. We employed this method to analyse the dependencies among editing in the 5 susceptible sites of the receptor mRNA and found that the proximal, juxtaposed sites A and B are strongly interdependent, and that the editing state of these two sites is a major determinant of the editing states of the other three sites, and hence the overall editing pattern. The statistical approach we developed for the analysis of mRNA editing can be applied to other cases of multiple site modification in RNA and proteins.
The International Society for Computational Biology, ISCB, organizes the largest event in the field of computational biology and bioinformatics, namely the annual international conference on Intelligent Systems for Molecular Biology, the ISMB. This year at ISMB 2012 in Long Beach, ISCB celebrated the 20th anniversary of its flagship meeting. ISCB is a young, lean and efficient society that aspires to make a significant impact with only limited resources. Many constraints make the choice of venues for ISMB a tough challenge. Here, we describe those challenges and invite the contribution of ideas for solutions.
We provide a novel method, DRISEE (duplicate read inferred sequencing error estimation), to assess sequencing quality (alternatively referred to as “noise” or “error”) within and/or between sequencing samples. DRISEE provides positional error estimates that can be used to inform read trimming within a sample. It also provides global (whole sample) error estimates that can be used to identify samples with high or varying levels of sequencing error that may confound downstream analyses, particularly in the case of studies that utilize data from multiple sequencing samples. For shotgun metagenomic data, we believe that DRISEE provides estimates of sequencing error that are more accurate and less constrained by technical limitations than existing methods that rely on reference genomes or the use of scores (e.g. Phred). Here, DRISEE is applied to (non amplicon) data sets from both the 454 and Illumina platforms. The DRISEE error estimate is obtained by analyzing sets of artifactual duplicate reads (ADRs), a known by-product of both sequencing platforms. We present DRISEE as an open-source, platform-independent method to assess sequencing error in shotgun metagenomic data, and utilize it to discover previously uncharacterized error in de novo sequence data from the 454 and Illumina sequencing platforms.
Sequence quality (referred to alternatively as the level of sequencing error or noise) is a primary concern to all sequence-dependent investigations. This is particularly true in the field of metagenomics where automated tools (e.g. annotation pipelines like MG-RAST) rely on high fidelity sequence data to derive meaningful biological inferences, and is exacerbated by the capacity of next generation sequencing platforms that continue to expand at a rate greater than Moore's law. We demonstrate that the most commonly utilized means to assess sequencing error exhibit severe limitations with respect to analysis of metagenomic data. Furthermore, we introduce a method (DRISEE) that accounts for these limitations through the application of a novel approach to assess sequencing error. DRISEE-based analyses reveal previously unobserved levels of sequencing error. DRISEE provides a platform independent measure of sequencing error that objectively assesses the quality of entire sequence samples. This assessment can be used to exclude low quality samples from computationally expensive analyses (e.g. annotation). It can also be used to evaluate the relative fidelity of analyses after they have been performed (e.g. annotation of error prone samples is less reliable than that of samples with low levels of sequencing error).
Meiosis is the cell division that halves the genetic component of diploid cells to form gametes or spores. To achieve this, meiotic cells undergo a radical spatial reorganisation of chromosomes. This reorganisation is a prerequisite for the pairing of parental homologous chromosomes and the reductional division, which halves the number of chromosomes in daughter cells. Of particular note is the change from a centromere clustered layout (Rabl configuration) to a telomere clustered conformation (bouquet stage). The contribution of the bouquet structure to homologous chromosome pairing is uncertain. We have developed a new in silico model to represent the chromosomes of Saccharomyces cerevisiae in space, based on a worm-like chain model constrained by attachment to the nuclear envelope and clustering forces. We have asked how these constraints could influence chromosome layout, with particular regard to the juxtaposition of homologous chromosomes and potential nonallelic, ectopic, interactions. The data support the view that the bouquet may be sufficient to bring short chromosomes together, but the contribution to long chromosomes is less. We also find that persistence length is critical to how much influence the bouquet structure could have, both on pairing of homologues and avoiding contacts with heterologues. This work represents an important development in computer modeling of chromosomes, and suggests new explanations for why elucidating the functional significance of the bouquet by genetics has been so difficult.
Organisms store their genetic material in the form of chromosomes that must be replicated and shared out during cell division. In sexual reproduction the cell division, called meiosis, halves the number of chromosomes to form gametes. This halving requires a complex reorganisation of chromosomes. Each gamete receives one maternal or one paternal copy of every chromosome. This requires a pairing process between the maternal and paternal chromosomes of each type. Once paired the two chromosomes are organised in space to bias subsequent movement in opposite directions when the nucleus divides. How chromosomes pair is of great importance to understanding fertility, and manipulating chromosomes in crops species, for which it is desirable to breed in new genes to improve hardiness or yield. We have modelled chromosomes in 3-dimensions based on the experimental organism Saccharomyces cerevisiae. We used our model to ask if various physical features of chromosomes might influence their ability to pair. We found that binding chromosome ends to the nuclear wall and pushing those ends together helps to encourage pairing along the length of chromosomes. It has long been known this special chromosome organisation occurs in live cells, but the significance of it has been difficult to determine.
Different data types can offer complementary perspectives on the same biological phenomenon. In cancer studies, for example, data on copy number alterations indicate losses and amplifications of genomic regions in tumours, while transcriptomic data point to the impact of genomic and environmental events on the internal wiring of the cell. Fusing different data provides a more comprehensive model of the cancer cell than that offered by any single type. However, biological signals in different patients exhibit diverse degrees of concordance due to cancer heterogeneity and inherent noise in the measurements. This is a particularly important issue in cancer subtype discovery, where personalised strategies to guide therapy are of vital importance. We present a nonparametric Bayesian model for discovering prognostic cancer subtypes by integrating gene expression and copy number variation data. Our model is constructed from a hierarchy of Dirichlet Processes and addresses three key challenges in data fusion: (i) To separate concordant from discordant signals, (ii) to select informative features, (iii) to estimate the number of disease subtypes. Concordance of signals is assessed individually for each patient, giving us an additional level of insight into the underlying disease structure. We exemplify the power of our model in prostate cancer and breast cancer and show that it outperforms competing methods. In the prostate cancer data, we identify an entirely new subtype with extremely poor survival outcome and show how other analyses fail to detect it. In the breast cancer data, we find subtypes with superior prognostic value by using the concordant results. These discoveries were crucially dependent on our model's ability to distinguish concordant and discordant signals within each patient sample, and would otherwise have been missed. We therefore demonstrate the importance of taking a patient-specific approach, using highly-flexible nonparametric Bayesian methods.
The goal of personalised medicine is to develop accurate diagnostic tests that identify patients who can benefit from targeted therapies. To achieve this goal it is necessary to stratify cancer patients into homogeneous subtypes according to which molecular aberrations their tumours exhibit. Prominent approaches for subtype definition combine information from different molecular levels, for example data on DNA copy number changes with data on mRNA expression changes. This is called data fusion. We contribute to this field by proposing a unified model that fuses different data types, finds informative features and estimates the number of subtypes in the data. The main strength of our model comes from the fact that we assess for each patient whether the different data agree on a subtype or not. Competing methods combine the data without checking for concordance of signals. On a breast cancer and a prostate cancer data set we show that concordance of signals has strong influence on subtype definition and that our model allows to define prognostic subtypes that would have been missed otherwise.
Gene fusions created by somatic genomic rearrangements are known to play an important role in the onset and development of some cancers, such as lymphomas and sarcomas. RNA-Seq (whole transcriptome shotgun sequencing) is proving to be a useful tool for the discovery of novel gene fusions in cancer transcriptomes. However, algorithmic methods for the discovery of gene fusions using RNA-Seq data remain underdeveloped. We have developed deFuse, a novel computational method for fusion discovery in tumor RNA-Seq data. Unlike existing methods that use only unique best-hit alignments and consider only fusion boundaries at the ends of known exons, deFuse considers all alignments and all possible locations for fusion boundaries. As a result, deFuse is able to identify fusion sequences with demonstrably better sensitivity than previous approaches. To increase the specificity of our approach, we curated a list of 60 true positive and 61 true negative fusion sequences (as confirmed by RT-PCR), and have trained an adaboost classifier on 11 novel features of the sequence data. The resulting classifier has an estimated value of 0.91 for the area under the ROC curve. We have used deFuse to discover gene fusions in 40 ovarian tumor samples, one ovarian cancer cell line, and three sarcoma samples. We report herein the first gene fusions discovered in ovarian cancer. We conclude that gene fusions are not infrequent events in ovarian cancer and that these events have the potential to substantially alter the expression patterns of the genes involved; gene fusions should therefore be considered in efforts to comprehensively characterize the mutational profiles of ovarian cancer transcriptomes.
Genome rearrangements and associated gene fusions are known to be important oncogenic events in some cancers. We have developed a novel computational method called deFuse for detecting gene fusions in RNA-Seq data and have applied it to the discovery of novel gene fusions in sarcoma and ovarian tumors. We assessed the accuracy of our method and found that deFuse produces substantially better sensitivity and specificity than two other published methods. We have also developed a set of 60 positive and 61 negative examples that will be useful for accurate identification of gene fusions in future RNA-Seq datasets. We have trained a classifier on 11 novel features of the 121 examples, and show that the classifier is able to accurately identify real gene fusions. The 45 gene fusions reported in this study represent the first ovarian cancer fusions reported, as well as novel sarcoma fusions. By examining the expression patterns of the affected genes, we find that many fusions are predicted to have functional consequences and thus merit experimental followup to determine their clinical relevance.
Chromosomal gains and losses comprise an important type of genetic change in tumors, and can now be assayed using microarray hybridization-based experiments. Most current statistical models for DNA copy number estimate total copy number, which do not distinguish between the underlying quantities of the two inherited chromosomes. This latter information, sometimes called parent specific copy number, is important for identifying allele-specific amplifications and deletions, for quantifying normal cell contamination, and for giving a more complete molecular portrait of the tumor. We propose a stochastic segmentation model for parent-specific DNA copy number in tumor samples, and give an estimation procedure that is computationally efficient and can be applied to data from the current high density genotyping platforms. The proposed method does not require matched normal samples, and can estimate the unknown genotypes simultaneously with the parent specific copy number. The new method is used to analyze 223 glioblastoma samples from the Cancer Genome Atlas (TCGA) project, giving a more comprehensive summary of the copy number events in these samples. Detailed case studies on these samples reveal the additional insights that can be gained from an allele-specific copy number analysis, such as the quantification of fractional gains and losses, the identification of copy neutral loss of heterozygosity, and the characterization of regions of simultaneous changes of both inherited chromosomes.
Many genetic diseases are related to copy number aberrations of some regions of the genome. As we know, each chromosome normally has two copies. However, under some circumstances, for some regions, either one or both of the chromosomes change. Genotyping microarray data provides the copy number of the two alleles of polymorphic sites along the chromosomes, which make the inference of the copy number aberrations of the chromosome feasible. One difficulty is that genotyping microarray data cannot provide the haplotype of the two copies of a chromosome. In this paper, we model the copy number along the chromosome as a two-dimensional Markov Chain. Using the observed copy number of both alleles of all the sites, we can determine the parent specific copy number along the chromosome as well as infer the haplotypes of the two copies of the inherited chromosomes in regions where there is allelic imbalance. Simulation results show high sensitivity and specificity of the method. Applying this method to glioblastoma samples from the Cancer Genome Atlas data illustrate the insights gained from allele-specific copy number analysis.
Molecular signatures are computational or mathematical models created to diagnose disease and other phenotypes and to predict clinical outcomes and response to treatment. It is widely recognized that molecular signatures constitute one of the most important translational and basic science developments enabled by recent high-throughput molecular assays. A perplexing phenomenon that characterizes high-throughput data analysis is the ubiquitous multiplicity of molecular signatures. Multiplicity is a special form of data analysis instability in which different analysis methods used on the same data, or different samples from the same population lead to different but apparently maximally predictive signatures. This phenomenon has far-reaching implications for biological discovery and development of next generation patient diagnostics and personalized treatments. Currently the causes and interpretation of signature multiplicity are unknown, and several, often contradictory, conjectures have been made to explain it. We present a formal characterization of signature multiplicity and a new efficient algorithm that offers theoretical guarantees for extracting the set of maximally predictive and non-redundant signatures independent of distribution. The new algorithm identifies exactly the set of optimal signatures in controlled experiments and yields signatures with significantly better predictivity and reproducibility than previous algorithms in human microarray gene expression datasets. Our results shed light on the causes of signature multiplicity, provide computational tools for studying it empirically and introduce a framework for in silico bioequivalence of this important new class of diagnostic and personalized medicine modalities.
One of the promises of personalized medicine is to use molecular information to better diagnose, manage, and treat disease. This promise is enabled through the use of molecular signatures that are computational models to predict a phenotype of interest from high-throughput assay data. Many molecular signatures have been developed to date, and some passed regulatory approval and are currently used in clinical practice. However, researchers have noted that it is possible to develop many different and equivalently accurate molecular signatures for the same phenotype and population. This phenomenon of signature multiplicity has far-reaching implications for biological discovery and development of next generation patient diagnostics and personalized treatments. Currently the causes and interpretation of signature multiplicity are unknown, and several, often contradictory, conjectures have been made to explain it. Our results shed light on the causes of signature multiplicity and provide a method for extracting all equivalently accurate signatures from high-throughput data.
Meaningful exchange of microarray data is currently difficult because it is rare that published data provide sufficient information depth or are even in the same format from one publication to another. MAGE will help microarray data producers and users to exchange information by providing a common platform for data exchange, and MAGE-STK will make the adoption of MAGE easier.
Meaningful exchange of microarray data is currently difficult because it is rare that published data provide sufficient information depth or are even in the same format from one publication to another. Only when data can be easily exchanged will the entire biological community be able to derive the full benefit from such microarray studies.
To this end we have developed three key ingredients towards standardizing the storage and exchange of microarray data. First, we have created a minimal information for the annotation of a microarray experiment (MIAME)-compliant conceptualization of microarray experiments modeled using the unified modeling language (UML) named MAGE-OM (microarray gene expression object model). Second, we have translated MAGE-OM into an XML-based data format, MAGE-ML, to facilitate the exchange of data. Third, some of us are now using MAGE (or its progenitors) in data production settings. Finally, we have developed a freely available software tool kit (MAGE-STK) that eases the integration of MAGE-ML into end users' systems.
MAGE will help microarray data producers and users to exchange information by providing a common platform for data exchange, and MAGE-STK will make the adoption of MAGE easier.