Search tips
Search criteria

Results 1-25 (1393486)

Clipboard (0)

Related Articles

1.  A Novel Mechanism Inducing Genome Instability in Kaposi's Sarcoma-Associated Herpesvirus Infected Cells 
PLoS Pathogens  2014;10(5):e1004098.
Kaposi's sarcoma-associated herpesvirus (KSHV) is an oncogenic herpesvirus associated with multiple AIDS-related malignancies. Like other herpesviruses, KSHV has a biphasic life cycle and both the lytic and latent phases are required for tumorigenesis. Evidence suggests that KSHV lytic replication can cause genome instability in KSHV-infected cells, although no mechanism has thus far been described. A surprising link has recently been suggested between mRNA export, genome instability and cancer development. Notably, aberrations in the cellular transcription and export complex (hTREX) proteins have been identified in high-grade tumours and these defects contribute to genome instability. We have previously shown that the lytically expressed KSHV ORF57 protein interacts with the complete hTREX complex; therefore, we investigated the possible intriguing link between ORF57, hTREX and KSHV-induced genome instability. Herein, we show that lytically active KSHV infected cells induce a DNA damage response and, importantly, we demonstrate directly that this is due to DNA strand breaks. Furthermore, we show that sequestration of the hTREX complex by the KSHV ORF57 protein leads to this double strand break response and significant DNA damage. Moreover, we describe a novel mechanism showing that the genetic instability observed is a consequence of R-loop formation. Importantly, the link between hTREX sequestration and DNA damage may be a common feature in herpesvirus infection, as a similar phenotype was observed with the herpes simplex virus 1 (HSV-1) ICP27 protein. Our data provide a model of R-loop induced DNA damage in KSHV infected cells and describes a novel system for studying genome instability caused by aberrant hTREX.
Author Summary
The hallmarks of cancer comprise the essential elements that permit the formation and development of human tumours. Genome instability is an enabling characteristic that allows the progression of tumorigenesis through genetic mutation and therefore, understanding the molecular causes of genome instability in all cancers is essential for development of therapeutics. The Kaposi's sarcoma-associated herpesvirus (KSHV) is an important human pathogen that causes multiple AIDS-related cancers. Recent studies have shown that during KSHV infection, cells show an increase in a double-strand DNA break marker, signifying a severe form of genome instability. Herein, we show that KSHV infection does cause DNA strand breaks. Moreover, we describe a novel molecular mechanism for genome instability involving the KSHV ORF57 protein interacting with the mRNA export complex, hTREX. We demonstrate that over-expression of ORF57 results in the formation of RNA:DNA hybrids, or R-loops, that lead to an increase in genome instability. DNA strand breaks have been previously reported in herpes simplex, cytomegalovirus and Epstein-Barr virus infected cells. Therefore, as this work describes for the first time the mechanism of R-loop induced genome instability involving a conserved herpesvirus protein, it may have far-reaching implications for other viral RNA export factors.
PMCID: PMC4006916  PMID: 24788796
2.  Sister chromatid cohesion defects are associated with chromosome instability in Hodgkin lymphoma cells 
BMC Cancer  2013;13:391.
Chromosome instability manifests as an abnormal chromosome complement and is a pathogenic event in cancer. Although a correlation between abnormal chromosome numbers and cancer exist, the underlying mechanisms that cause chromosome instability are poorly understood. Recent data suggests that aberrant sister chromatid cohesion causes chromosome instability and thus contributes to the development of cancer. Cohesion normally functions by tethering nascently synthesized chromatids together to prevent premature segregation and thus chromosome instability. Although the prevalence of aberrant cohesion has been reported for some solid tumors, its prevalence within liquid tumors is unknown. Consequently, the current study was undertaken to evaluate aberrant cohesion within Hodgkin lymphoma, a lymphoid malignancy that frequently exhibits chromosome instability.
Using established cytogenetic techniques, the prevalence of chromosome instability and aberrant cohesion was examined within mitotic spreads generated from five commonly employed Hodgkin lymphoma cell lines (L-1236, KM-H2, L-428, L-540 and HDLM-2) and a lymphocyte control. Indirect immunofluorescence and Western blot analyses were performed to evaluate the localization and expression of six critical proteins involved in the regulation of sister chromatid cohesion.
We first confirmed that all five Hodgkin lymphoma cell lines exhibited chromosome instability relative to the lymphocyte control. We then determined that each Hodgkin lymphoma cell line exhibited cohesion defects that were subsequently classified into mild, moderate or severe categories. Surprisingly, ~50% of the mitotic spreads generated from L-540 and HDLM-2 harbored cohesion defects. To gain mechanistic insight into the underlying cause of the aberrant cohesion we examined the localization and expression of six critical proteins involved in cohesion. Although all proteins produced the expected nuclear localization pattern, striking differences in RAD21 expression was observed: RAD21 expression was lowest in L-540 and highest within HDLM-2.
We conclude that aberrant cohesion is a common feature of all five Hodgkin lymphoma cell lines evaluated. We further conclude that aberrant RAD21 expression is a strong candidate to underlie aberrant cohesion, chromosome instability and contribute to the development of the disease. Our findings support a growing body of evidence suggesting that cohesion defects and aberrant RAD21 expression are pathogenic events that contribute to tumor development.
PMCID: PMC3751861  PMID: 23962039
Hodgkin lymphoma; Chromosome instability; Sister chromatid cohesion; HDLM-2; L-540; RAD21
3.  Structural and functional protein network analyses predict novel signaling functions for rhodopsin 
Proteomic analyses, literature mining, and structural data were combined to generate an extensive signaling network linked to the visual G protein-coupled receptor rhodopsin. Network analysis suggests novel signaling routes to cytoskeleton dynamics and vesicular trafficking.
Using a shotgun proteomic approach, we identified the protein inventory of the light sensing outer segment of the mammalian photoreceptor.These data, combined with literature mining, structural modeling, and computational analysis, offer a comprehensive view of signal transduction downstream of the visual G protein-coupled receptor rhodopsin.The network suggests novel signaling branches downstream of rhodopsin to cytoskeleton dynamics and vesicular trafficking.The network serves as a basis for elucidating physiological principles of photoreceptor function and suggests potential disease-associated proteins.
Photoreceptor cells are neurons capable of converting light into electrical signals. The rod outer segment (ROS) region of the photoreceptor cells is a cellular structure made of a stack of around 800 closed membrane disks loaded with rhodopsin (Liang et al, 2003; Nickell et al, 2007). In disc membranes, rhodopsin arranges itself into paracrystalline dimer arrays, enabling optimal association with the heterotrimeric G protein transducin as well as additional regulatory components (Ciarkowski et al, 2005). Disruption of these highly regulated structures and processes by germline mutations is the cause of severe blinding diseases such as retinitis pigmentosa, macular degeneration, or congenital stationary night blindness (Berger et al, 2010).
Traditionally, signal transduction networks have been studied by combining biochemical and genetic experiments addressing the relations among a small number of components. More recently, large throughput experiments using different techniques like two hybrid or co-immunoprecipitation coupled to mass spectrometry have added a new level of complexity (Ito et al, 2001; Gavin et al, 2002, 2006; Ho et al, 2002; Rual et al, 2005; Stelzl et al, 2005). However, in these studies, space, time, and the fact that many interactions detected for a particular protein are not compatible, are not taken into consideration. Structural information can help discriminate between direct and indirect interactions and more importantly it can determine if two or more predicted partners of any given protein or complex can simultaneously bind a target or rather compete for the same interaction surface (Kim et al, 2006).
In this work, we build a functional and dynamic interaction network centered on rhodopsin on a systems level, using six steps: In step 1, we experimentally identified the proteomic inventory of the porcine ROS, and we compared our data set with a recent proteomic study from bovine ROS (Kwok et al, 2008). The union of the two data sets was defined as the ‘initial experimental ROS proteome'. After removal of contaminants and applying filtering methods, a ‘core ROS proteome', consisting of 355 proteins, was defined.
In step 2, proteins of the core ROS proteome were assigned to six functional modules: (1) vision, signaling, transporters, and channels; (2) outer segment structure and morphogenesis; (3) housekeeping; (4) cytoskeleton and polarity; (5) vesicles formation and trafficking, and (6) metabolism.
In step 3, a protein-protein interaction network was constructed based on the literature mining. Since for most of the interactions experimental evidence was co-immunoprecipitation, or pull-down experiments, and in addition many of the edges in the network are supported by single experimental evidence, often derived from high-throughput approaches, we refer to this network, as ‘fuzzy ROS interactome'. Structural information was used to predict binary interactions, based on the finding that similar domain pairs are likely to interact in a similar way (‘nature repeats itself') (Aloy and Russell, 2002). To increase the confidence in the resulting network, edges supported by a single evidence not coming from yeast two-hybrid experiments were removed, exception being interactions where the evidence was the existence of a three-dimensional structure of the complex itself, or of a highly homologous complex. This curated static network (‘high-confidence ROS interactome') comprises 660 edges linking the majority of the nodes. By considering only edges supported by at least one evidence of direct binary interaction, we end up with a ‘high-confidence binary ROS interactome'. We next extended the published core pathway (Dell'Orco et al, 2009) using evidence from our high-confidence network. We find several new direct binary links to different cellular functional processes (Figure 4): the active rhodopsin interacts with Rac1 and the GTP form of Rho. There is also a connection between active rhodopsin and Arf4, as well as PDEδ with Rab13 and the GTP-bound form of Arl3 that links the vision cycle to vesicle trafficking and structure. We see a connection between PDEδ with prenyl-modified proteins, such as several small GTPases, as well as with rhodopsin kinase. Further, our network reveals several direct binary connections between Ca2+-regulated proteins and cytoskeleton proteins; these are CaMK2A with actinin, calmodulin with GAP43 and S1008, and PKC with 14-3-3 family members.
In step 4, part of the network was experimentally validated using three different approaches to identify physical protein associations that would occur under physiological conditions: (i) Co-segregation/co-sedimentation experiments, (ii) immunoprecipitations combined with mass spectrometry and/or subsequent immunoblotting, and (iii) utilizing the glycosylated N-terminus of rhodopsin to isolate its associated protein partners by Concanavalin A affinity purification. In total, 60 co-purification and co-elution experiments supported interactions that were already in our literature network, and new evidence from 175 co-IP experiments in this work was added. Next, we aimed to provide additional independent experimental confirmation for two of the novel networks and functional links proposed based on the network analysis: (i) the proposed complex between Rac1/RhoA/CRMP-2/tubulin/and ROCK II in ROS was investigated by culturing retinal explants in the presence of an ROCK II-specific inhibitor (Figure 6). While morphology of the retinas treated with ROCK II inhibitor appeared normal, immunohistochemistry analyses revealed several alterations on the protein level. (ii) We supported the hypothesis that PDEδ could function as a GDI for Rac1 in ROS, by demonstrating that PDEδ and Rac1 co localize in ROS and that PDEδ could dissociate Rac1 from ROS membranes in vitro.
In step 5, we use structural information to distinguish between mutually compatible (‘AND') or excluded (‘XOR') interactions. This enables breaking a network of nodes and edges into functional machines or sub-networks/modules. In the vision branch, both ‘AND' and ‘XOR' gates synergize. This may allow dynamic tuning of light and dark states. However, all connections from the vision module to other modules are ‘XOR' connections suggesting that competition, in connection with local protein concentration changes, could be important for transmitting signals from the core vision module.
In the last step, we map and functionally characterize the known mutations that produce blindness.
In summary, this represents the first comprehensive, dynamic, and integrative rhodopsin signaling network, which can be the basis for integrating and mapping newly discovered disease mutants, to guide protein or signaling branch-specific therapies.
Orchestration of signaling, photoreceptor structural integrity, and maintenance needed for mammalian vision remain enigmatic. By integrating three proteomic data sets, literature mining, computational analyses, and structural information, we have generated a multiscale signal transduction network linked to the visual G protein-coupled receptor (GPCR) rhodopsin, the major protein component of rod outer segments. This network was complemented by domain decomposition of protein–protein interactions and then qualified for mutually exclusive or mutually compatible interactions and ternary complex formation using structural data. The resulting information not only offers a comprehensive view of signal transduction induced by this GPCR but also suggests novel signaling routes to cytoskeleton dynamics and vesicular trafficking, predicting an important level of regulation through small GTPases. Further, it demonstrates a specific disease susceptibility of the core visual pathway due to the uniqueness of its components present mainly in the eye. As a comprehensive multiscale network, it can serve as a basis to elucidate the physiological principles of photoreceptor function, identify potential disease-associated genes and proteins, and guide the development of therapies that target specific branches of the signaling pathway.
PMCID: PMC3261702  PMID: 22108793
protein interaction network; rhodopsin signaling; structural modeling
4.  A novel approach to investigate tissue-specific trinucleotide repeat instability 
BMC Systems Biology  2010;4:29.
In Huntington's disease (HD), an expanded CAG repeat produces characteristic striatal neurodegeneration. Interestingly, the HD CAG repeat, whose length determines age at onset, undergoes tissue-specific somatic instability, predominant in the striatum, suggesting that tissue-specific CAG length changes could modify the disease process. Therefore, understanding the mechanisms underlying the tissue specificity of somatic instability may provide novel routes to therapies. However progress in this area has been hampered by the lack of sensitive high-throughput instability quantification methods and global approaches to identify the underlying factors.
Here we describe a novel approach to gain insight into the factors responsible for the tissue specificity of somatic instability. Using accurate genetic knock-in mouse models of HD, we developed a reliable, high-throughput method to quantify tissue HD CAG repeat instability and integrated this with genome-wide bioinformatic approaches. Using tissue instability quantified in 16 tissues as a phenotype and tissue microarray gene expression as a predictor, we built a mathematical model and identified a gene expression signature that accurately predicted tissue instability. Using the predictive ability of this signature we found that somatic instability was not a consequence of pathogenesis. In support of this, genetic crosses with models of accelerated neuropathology failed to induce somatic instability. In addition, we searched for genes and pathways that correlated with tissue instability. We found that expression levels of DNA repair genes did not explain the tissue specificity of somatic instability. Instead, our data implicate other pathways, particularly cell cycle, metabolism and neurotransmitter pathways, acting in combination to generate tissue-specific patterns of instability.
Our study clearly demonstrates that multiple tissue factors reflect the level of somatic instability in different tissues. In addition, our quantitative, genome-wide approach is readily applicable to high-throughput assays and opens the door to widespread applications with the potential to accelerate the discovery of drugs that alter tissue instability.
PMCID: PMC2856555  PMID: 20302627
5.  Karyotypic Determinants of Chromosome Instability in Aneuploid Budding Yeast 
PLoS Genetics  2012;8(5):e1002719.
Recent studies in cancer cells and budding yeast demonstrated that aneuploidy, the state of having abnormal chromosome numbers, correlates with elevated chromosome instability (CIN), i.e. the propensity of gaining and losing chromosomes at a high frequency. Here we have investigated ploidy- and chromosome-specific determinants underlying aneuploidy-induced CIN by observing karyotype dynamics in fully isogenic aneuploid yeast strains with ploidies between 1N and 2N obtained through a random meiotic process. The aneuploid strains exhibited various levels of whole-chromosome instability (i.e. chromosome gains and losses). CIN correlates with cellular ploidy in an unexpected way: cells with a chromosomal content close to the haploid state are significantly more stable than cells displaying an apparent ploidy between 1.5 and 2N. We propose that the capacity for accurate chromosome segregation by the mitotic system does not scale continuously with an increasing number of chromosomes, but may occur via discrete steps each time a full set of chromosomes is added to the genome. On top of such general ploidy-related effect, CIN is also associated with the presence of specific aneuploid chromosomes as well as dosage imbalance between specific chromosome pairs. Our findings potentially help reconcile the divide between gene-centric versus genome-centric theories in cancer evolution.
Author Summary
Aneuploidy, the state of harboring an unbalanced number of chromosomes, has long been hypothesized to be at the basis of malignant transformation. Recent studies have also shown that aneuploidy is an important form of genome alteration underlying adaptive evolution of cells in response to harsh environments or genetic perturbations. In addition to the profound effect that aneuploidy has on gene expression and phenotype, another feature thought to contribute to aneuploidy's role in cancer and cellular evolution is the heightened chromosome instability of aneuploid cells. Since chromosome instability is the condition of gaining and losing chromosomes at a high frequency, this could lead to a vicious cycle in which aneuploidy could lead to further enhanced genetic diversity. Given the ever-changing and heterogeneous aneuploid cell populations, and the difficulty of separating the effect of aneuploidy from other types of genetic aberrations, the molecular mechanisms underlying aneuploidy-driven chromosome instability have remained largely unexplored. Here we describe the first unbiased and systematic investigation of chromosome instability associated with aneuploid genomes in the budding yeast Saccharomyces cerevisiae. Our results revealed both genome-level and chromosome-specific determinants of chromosome instability in aneuploid yeast. Our findings potentially help explain the molecular mechanism underlying a major source of genome instability in cancer.
PMCID: PMC3355078  PMID: 22615582
6.  Assessing the Significance of Conserved Genomic Aberrations Using High Resolution Genomic Microarrays 
PLoS Genetics  2007;3(8):e143.
Genomic aberrations recurrent in a particular cancer type can be important prognostic markers for tumor progression. Typically in early tumorigenesis, cells incur a breakdown of the DNA replication machinery that results in an accumulation of genomic aberrations in the form of duplications, deletions, translocations, and other genomic alterations. Microarray methods allow for finer mapping of these aberrations than has previously been possible; however, data processing and analysis methods have not taken full advantage of this higher resolution. Attention has primarily been given to analysis on the single sample level, where multiple adjacent probes are necessarily used as replicates for the local region containing their target sequences. However, regions of concordant aberration can be short enough to be detected by only one, or very few, array elements. We describe a method called Multiple Sample Analysis for assessing the significance of concordant genomic aberrations across multiple experiments that does not require a-priori definition of aberration calls for each sample. If there are multiple samples, representing a class, then by exploiting the replication across samples our method can detect concordant aberrations at much higher resolution than can be derived from current single sample approaches. Additionally, this method provides a meaningful approach to addressing population-based questions such as determining important regions for a cancer subtype of interest or determining regions of copy number variation in a population. Multiple Sample Analysis also provides single sample aberration calls in the locations of significant concordance, producing high resolution calls per sample, in concordant regions. The approach is demonstrated on a dataset representing a challenging but important resource: breast tumors that have been formalin-fixed, paraffin-embedded, archived, and subsequently UV-laser capture microdissected and hybridized to two-channel BAC arrays using an amplification protocol. We demonstrate the accurate detection on simulated data, and on real datasets involving known regions of aberration within subtypes of breast cancer at a resolution consistent with that of the array. Similarly, we apply our method to previously published datasets, including a 250K SNP array, and verify known results as well as detect novel regions of concordant aberration. The algorithm has been fully implemented and tested and is freely available as a Java application at
Author Summary
Cancer is a genetic disease caused by genomic mutations that confer an increased ability to proliferate and survive in a specific environment. It is now known that many regions of genomic DNA are deleted or amplified in specific cancer types. These aberrations are believed to occur randomly in the genome. If these aberrations overlap more than would be expected by chance across individual occurrences of the cancer this suggests a selective pressure on this aberration. These conserved aberrations likely represent regions that are important for the development, progression, and survival of a specific cancer type in its environment. We present a method for identifying these conserved aberrations within a class of samples. The applications for this method include accurate high resolution mapping of aberrations characteristic of cancer subtypes as well as other genetic diseases and determination of conserved copy number variations in the population. With the use of high resolution microarray methods we have profiled different tumor types. We have been able to create high resolution profiles of conserved aberrations in specific cancer types. These conserved aberrations are prime targets for cancer therapies and many of these regions have already been used to develop effective cancer therapeutics.
PMCID: PMC1950957  PMID: 17722985
7.  Network modeling of the transcriptional effects of copy number aberrations in glioblastoma 
DNA copy number aberrations (CNAs) are a characteristic feature of cancer genomes. In this work, Rebecka Jörnsten, Sven Nelander and colleagues combine network modeling and experimental methods to analyze the systems-level effects of CNAs in glioblastoma.
We introduce a modeling approach termed EPoC (Endogenous Perturbation analysis of Cancer), enabling the construction of global, gene-level models that causally connect gene copy number with expression in glioblastoma.On the basis of the resulting model, we predict genes that are likely to be disease-driving and validate selected predictions experimentally. We also demonstrate that further analysis of the network model by sparse singular value decomposition allows stratification of patients with glioblastoma into short-term and long-term survivors, introducing decomposed network models as a useful principle for biomarker discovery.Finally, in systematic comparisons, we demonstrate that EPoC is computationally efficient and yields more consistent results than mRNA-only methods, standard eQTL methods, and two recent multivariate methods for genotype–mRNA coupling.
Gains and losses of chromosomal material (DNA copy number aberrations; CNAs) are a characteristic feature of cancer genomes. At the level of a single locus, it is well known that increased copy number (gene amplification) typically leads to increased gene expression, whereas decreased copy number (gene deletion) leads to decreased gene expression (Pollack et al, 2002; Lee et al, 2008; Nilsson et al, 2008). However, CNAs also affect the expression of genes located outside the amplified/deleted region itself via indirect mechanisms. To fully understand the action of CNAs, it is therefore necessary to analyze their action in a network context. Toward this goal, improved computational approaches will be important, if not essential.
To determine the global effects on transcription of CNAs in the brain tumor glioblastoma, we develop EPoC (Endogenous Perturbation analysis of Cancer), a computational technique capable of inferring sparse, causal network models by combining genome-wide, paired CNA- and mRNA-level data. EPoC aims to detect disease-driving copy number aberrations and their effect on target mRNA expression, and stratify patients into long-term and short-term survivors. Technically, EPoC relates CNA perturbations to mRNA responses by matrix equations, derived from a steady-state approximation of the transcriptional network. Patient prognostic scores are obtained from singular value decompositions of the network matrix. The models are constructed by solving a large-scale, regularized regression problem.
We apply EPoC to glioblastoma data from The Cancer Genome Atlas (TCGA) consortium (186 patients). The identified CNA-driven network comprises 10 672 genes, and contains a number of copy number-altered genes that control multiple downstream genes. Highly connected hub genes include well-known oncogenes and tumor supressor genes that are frequently deleted or amplified in glioblastoma, including EGFR, PDGFRA, CDKN2A and CDKN2B, confirming a clear association between these aberrations and transcriptional variability of these brain tumors. In addition, we identify a number of hub genes that have previously not been associated with glioblastoma, including interferon alpha 1 (IFNA1), myeloid/lymphoid or mixed-lineage leukemia translocated to 10 (MLLT10, a well-known leukemia gene), glutamate decarboxylase 2 GAD2, a postulated glutamate receptor GPR158 and Necdin (NDN). Furthermore, we demonstrate that the network model contains useful information on downstream target genes (including stem cell regulators), and possible drug targets.
We proceed to explore the validity of a small network region experimentally. Introducing experimental perturbations of NDN and other targets in four glioblastoma cell lines (T98G, U-87MG, U-343MG and U-373MG), we confirm several predicted mechanisms. We also demonstrate that the TCGA glioblastoma patients can be stratified into long-term and short-term survivors, using our proposed prognostic scores derived from a singular vector decomposition of the network model. Finally, we compare EPoC to existing methods for mRNA networks analysis and expression quantitative locus methods, and demonstrate that EPoC produces more consistent models between technically independent glioblastoma data sets, and that the EPoC models exhibit better overlap with known protein–protein interaction networks and pathway maps.
In summary, we conclude that large-scale integrative modeling reveals mechanistically and prognostically informative networks in human glioblastoma. Our approach operates at the gene level and our data support that individual hub genes can be identified in practice. Very large aberrations, however, cannot be fully resolved by the current modeling strategy.
DNA copy number aberrations (CNAs) are a hallmark of cancer genomes. However, little is known about how such changes affect global gene expression. We develop a modeling framework, EPoC (Endogenous Perturbation analysis of Cancer), to (1) detect disease-driving CNAs and their effect on target mRNA expression, and to (2) stratify cancer patients into long- and short-term survivors. Our method constructs causal network models of gene expression by combining genome-wide DNA- and RNA-level data. Prognostic scores are obtained from a singular value decomposition of the networks. By applying EPoC to glioblastoma data from The Cancer Genome Atlas consortium, we demonstrate that the resulting network models contain known disease-relevant hub genes, reveal interesting candidate hubs, and uncover predictors of patient survival. Targeted validations in four glioblastoma cell lines support selected predictions, and implicate the p53-interacting protein Necdin in suppressing glioblastoma cell growth. We conclude that large-scale network modeling of the effects of CNAs on gene expression may provide insights into the biology of human cancer. Free software in MATLAB and R is provided.
PMCID: PMC3101951  PMID: 21525872
cancer biology; cancer genomics; glioblastoma
8.  Kernel-imbedded Gaussian processes for disease classification using microarray gene expression data 
BMC Bioinformatics  2007;8:67.
Designing appropriate machine learning methods for identifying genes that have a significant discriminating power for disease outcomes has become more and more important for our understanding of diseases at genomic level. Although many machine learning methods have been developed and applied to the area of microarray gene expression data analysis, the majority of them are based on linear models, which however are not necessarily appropriate for the underlying connection between the target disease and its associated explanatory genes. Linear model based methods usually also bring in false positive significant features more easily. Furthermore, linear model based algorithms often involve calculating the inverse of a matrix that is possibly singular when the number of potentially important genes is relatively large. This leads to problems of numerical instability. To overcome these limitations, a few non-linear methods have recently been introduced to the area. Many of the existing non-linear methods have a couple of critical problems, the model selection problem and the model parameter tuning problem, that remain unsolved or even untouched. In general, a unified framework that allows model parameters of both linear and non-linear models to be easily tuned is always preferred in real-world applications. Kernel-induced learning methods form a class of approaches that show promising potentials to achieve this goal.
A hierarchical statistical model named kernel-imbedded Gaussian process (KIGP) is developed under a unified Bayesian framework for binary disease classification problems using microarray gene expression data. In particular, based on a probit regression setting, an adaptive algorithm with a cascading structure is designed to find the appropriate kernel, to discover the potentially significant genes, and to make the optimal class prediction accordingly. A Gibbs sampler is built as the core of the algorithm to make Bayesian inferences. Simulation studies showed that, even without any knowledge of the underlying generative model, the KIGP performed very close to the theoretical Bayesian bound not only in the case with a linear Bayesian classifier but also in the case with a very non-linear Bayesian classifier. This sheds light on its broader usability to microarray data analysis problems, especially to those that linear methods work awkwardly. The KIGP was also applied to four published microarray datasets, and the results showed that the KIGP performed better than or at least as well as any of the referred state-of-the-art methods did in all of these cases.
Mathematically built on the kernel-induced feature space concept under a Bayesian framework, the KIGP method presented in this paper provides a unified machine learning approach to explore both the linear and the possibly non-linear underlying relationship between the target features of a given binary disease classification problem and the related explanatory gene expression data. More importantly, it incorporates the model parameter tuning into the framework. The model selection problem is addressed in the form of selecting a proper kernel type. The KIGP method also gives Bayesian probabilistic predictions for disease classification. These properties and features are beneficial to most real-world applications. The algorithm is naturally robust in numerical computation. The simulation studies and the published data studies demonstrated that the proposed KIGP performs satisfactorily and consistently.
PMCID: PMC1821044  PMID: 17328811
9.  Transcriptome instability in colorectal cancer identified by exon microarray analyses: Associations with splicing factor expression levels and patient survival 
Genome Medicine  2011;3(5):32.
Colorectal cancer (CRC) is a heterogeneous disease that, on the molecular level, can be characterized by inherent genomic instabilities; chromosome instability and microsatellite instability. In the present study we analyze genome-wide disruption of pre-mRNA splicing, and propose transcriptome instability as a characteristic that is analogous to genomic instability on the transcriptome level.
Exon microarray profiles from two independent series including a total of 160 CRCs were investigated for their relative amounts of exon usage differences. Each exon in each sample was assigned an alternative splicing score calculated by the FIRMA algorithm. Amounts of deviating exon usage per sample were derived from exons with extreme splicing scores.
There was great heterogeneity within both series in terms of sample-wise amounts of deviating exon usage. This was strongly associated with the expression levels of approximately half of 280 splicing factors (54% and 48% of splicing factors were significantly correlated to deviating exon usage amounts in the two series). Samples with high or low amounts of deviating exon usage, associated with overall transcriptome instability, were almost completely separated into their respective groups by hierarchical clustering analysis of splicing factor expression levels in both sample series. Samples showing a preferential tendency towards deviating exon skipping or inclusion were associated with skewed transcriptome instability. There were significant associations between transcriptome instability and reduced patient survival in both sample series. In the test series, patients with skewed transcriptome instability showed the strongest prognostic association (P = 0.001), while a combination of the two characteristics showed the strongest association with poor survival in the validation series (P = 0.03).
We have described transcriptome instability as a characteristic of CRC. This transcriptome instability has associations with splicing factor expression levels and poor patient survival.
PMCID: PMC3219073  PMID: 21619627
10.  Micro-Scale Genomic DNA Copy Number Aberrations as Another Means of Mutagenesis in Breast Cancer 
PLoS ONE  2012;7(12):e51719.
In breast cancer, the basal-like subtype has high levels of genomic instability relative to other breast cancer subtypes with many basal-like-specific regions of aberration. There is evidence that this genomic instability extends to smaller scale genomic aberrations, as shown by a previously described micro-deletion event in the PTEN gene in the Basal-like SUM149 breast cancer cell line.
We sought to identify if small regions of genomic DNA copy number changes exist by using a high density, gene-centric Comparative Genomic Hybridizations (CGH) array on cell lines and primary tumors. A custom tiling array for CGH (244,000 probes, 200 bp tiling resolution) was created to identify small regions of genomic change, which was focused on previously identified basal-like-specific, and general cancer genes. Tumor genomic DNA from 94 patients and 2 breast cancer cell lines was labeled and hybridized to these arrays. Aberrations were called using SWITCHdna and the smallest 25% of SWITCHdna-defined genomic segments were called micro-aberrations (<64 contiguous probes, ∼ 15 kb).
Our data showed that primary tumor breast cancer genomes frequently contained many small-scale copy number gains and losses, termed micro-aberrations, most of which are undetectable using typical-density genome-wide aCGH arrays. The basal-like subtype exhibited the highest incidence of these events. These micro-aberrations sometimes altered expression of the involved gene. We confirmed the presence of the PTEN micro-amplification in SUM149 and by mRNA-seq showed that this resulted in loss of expression of all exons downstream of this event. Micro-aberrations disproportionately affected the 5′ regions of the affected genes, including the promoter region, and high frequency of micro-aberrations was associated with poor survival.
Using a high-probe-density, gene-centric aCGH microarray, we present evidence of small-scale genomic aberrations that can contribute to gene inactivation. These events may contribute to tumor formation through mechanisms not detected using conventional DNA copy number analyses.
PMCID: PMC3524128  PMID: 23284754
11.  Aneuploidy prediction and tumor classification with heterogeneous hidden conditional random fields 
Bioinformatics  2008;25(10):1307-1313.
Motivation: The heterogeneity of cancer cannot always be recognized by tumor morphology, but may be reflected by the underlying genetic aberrations. Array comparative genome hybridization (array-CGH) methods provide high-throughput data on genetic copy numbers, but determining the clinically relevant copy number changes remains a challenge. Conventional classification methods for linking recurrent alterations to clinical outcome ignore sequential correlations in selecting relevant features. Conversely, existing sequence classification methods can only model overall copy number instability, without regard to any particular position in the genome.
Results: Here, we present the heterogeneous hidden conditional random field, a new integrated array-CGH analysis method for jointly classifying tumors, inferring copy numbers and identifying clinically relevant positions in recurrent alteration regions. By capturing the sequentiality as well as the locality of changes, our integrated model provides better noise reduction, and achieves more relevant gene retrieval and more accurate classification than existing methods. We provide an efficient L1-regularized discriminative training algorithm, which notably selects a small set of candidate genes most likely to be clinically relevant and driving the recurrent amplicons of importance. Our method thus provides unbiased starting points in deciding which genomic regions and which genes in particular to pursue for further examination. Our experiments on synthetic data and real genomic cancer prediction data show that our method is superior, both in prediction accuracy and relevant feature discovery, to existing methods. We also demonstrate that it can be used to generate novel biological hypotheses for breast cancer.
Supplementary information:Supplementary data are available at Bioinformatics online.
PMCID: PMC2677736  PMID: 19052061
12.  A sampling framework for incorporating quantitative mass spectrometry data in protein interaction analysis 
BMC Bioinformatics  2013;14:299.
Comprehensive protein-protein interaction (PPI) maps are a powerful resource for uncovering the molecular basis of genetic interactions and providing mechanistic insights. Over the past decade, high-throughput experimental techniques have been developed to generate PPI maps at proteome scale, first using yeast two-hybrid approaches and more recently via affinity purification combined with mass spectrometry (AP-MS). Unfortunately, data from both protocols are prone to both high false positive and false negative rates. To address these issues, many methods have been developed to post-process raw PPI data. However, with few exceptions, these methods only analyze binary experimental data (in which each potential interaction tested is deemed either observed or unobserved), neglecting quantitative information available from AP-MS such as spectral counts.
We propose a novel method for incorporating quantitative information from AP-MS data into existing PPI inference methods that analyze binary interaction data. Our approach introduces a probabilistic framework that models the statistical noise inherent in observations of co-purifications. Using a sampling-based approach, we model the uncertainty of interactions with low spectral counts by generating an ensemble of possible alternative experimental outcomes. We then apply the existing method of choice to each alternative outcome and aggregate results over the ensemble. We validate our approach on three recent AP-MS data sets and demonstrate performance comparable to or better than state-of-the-art methods. Additionally, we provide an in-depth discussion comparing the theoretical bases of existing approaches and identify common aspects that may be key to their performance.
Our sampling framework extends the existing body of work on PPI analysis using binary interaction data to apply to the richer quantitative data now commonly available through AP-MS assays. This framework is quite general, and many enhancements are likely possible. Fruitful future directions may include investigating more sophisticated schemes for converting spectral counts to probabilities and applying the framework to direct protein complex prediction methods.
PMCID: PMC3851523  PMID: 24093595
13.  PSCC: Sensitive and Reliable Population-Scale Copy Number Variation Detection Method Based on Low Coverage Sequencing 
PLoS ONE  2014;9(1):e85096.
Copy number variations (CNVs) represent an important type of genetic variation that deeply impact phenotypic polymorphisms and human diseases. The advent of high-throughput sequencing technologies provides an opportunity to revolutionize the discovery of CNVs and to explore their relationship with diseases. However, most of the existing methods depend on sequencing depth and show instability with low sequence coverage. In this study, using low coverage whole-genome sequencing (LCS) we have developed an effective population-scale CNV calling (PSCC) method.
Methodology/Principal Findings
In our novel method, two-step correction was used to remove biases caused by local GC content and complex genomic characteristics. We chose a binary segmentation method to locate CNV segments and designed combined statistics tests to ensure the stable performance of the false positive control. The simulation data showed that our PSCC method could achieve 99.7%/100% and 98.6%/100% sensitivity and specificity for over 300 kb CNV calling in the condition of LCS (∼2×) and ultra LCS (∼0.2×), respectively. Finally, we applied this novel method to analyze 34 clinical samples with an average of 2× LCS. In the final results, all the 31 pathogenic CNVs identified by aCGH were successfully detected. In addition, the performance comparison revealed that our method had significant advantages over existing methods using ultra LCS.
Our study showed that PSCC can sensitively and reliably detect CNVs using low coverage or even ultra-low coverage data through population-scale sequencing.
PMCID: PMC3897425  PMID: 24465483
14.  Group sparse canonical correlation analysis for genomic data integration 
BMC Bioinformatics  2013;14:245.
The emergence of high-throughput genomic datasets from different sources and platforms (e.g., gene expression, single nucleotide polymorphisms (SNP), and copy number variation (CNV)) has greatly enhanced our understandings of the interplay of these genomic factors as well as their influences on the complex diseases. It is challenging to explore the relationship between these different types of genomic data sets. In this paper, we focus on a multivariate statistical method, canonical correlation analysis (CCA) method for this problem. Conventional CCA method does not work effectively if the number of data samples is significantly less than that of biomarkers, which is a typical case for genomic data (e.g., SNPs). Sparse CCA (sCCA) methods were introduced to overcome such difficulty, mostly using penalizations with l-1 norm (CCA-l1) or the combination of l-1and l-2 norm (CCA-elastic net). However, they overlook the structural or group effect within genomic data in the analysis, which often exist and are important (e.g., SNPs spanning a gene interact and work together as a group).
We propose a new group sparse CCA method (CCA-sparse group) along with an effective numerical algorithm to study the mutual relationship between two different types of genomic data (i.e., SNP and gene expression). We then extend the model to a more general formulation that can include the existing sCCA models. We apply the model to feature/variable selection from two data sets and compare our group sparse CCA method with existing sCCA methods on both simulation and two real datasets (human gliomas data and NCI60 data). We use a graphical representation of the samples with a pair of canonical variates to demonstrate the discriminating characteristic of the selected features. Pathway analysis is further performed for biological interpretation of those features.
The CCA-sparse group method incorporates group effects of features into the correlation analysis while performs individual feature selection simultaneously. It outperforms the two sCCA methods (CCA-l1 and CCA-group) by identifying the correlated features with more true positives while controlling total discordance at a lower level on the simulated data, even if the group effect does not exist or there are irrelevant features grouped with true correlated features. Compared with our proposed CCA-group sparse models, CCA-l1 tends to select less true correlated features while CCA-group inclines to select more redundant features.
PMCID: PMC3751310  PMID: 23937249
Group sparse CCA; Genomic data integration; Feature selection; SNP
15.  High-resolution aCGH and expression profiling identifies a novel genomic subtype of ER negative breast cancer 
Genome Biology  2007;8(10):R215.
High resolution array-CGH and expression profiling identifies a novel genomic subtype of ER negative breast cancer, and provides a genome-wide list of common copy number alterations associated with aberrant expression and poor prognosis.
The characterization of copy number alteration patterns in breast cancer requires high-resolution genome-wide profiling of a large panel of tumor specimens. To date, most genome-wide array comparative genomic hybridization studies have used tumor panels of relatively large tumor size and high Nottingham Prognostic Index (NPI) that are not as representative of breast cancer demographics.
We performed an oligo-array-based high-resolution analysis of copy number alterations in 171 primary breast tumors of relatively small size and low NPI, which was therefore more representative of breast cancer demographics. Hierarchical clustering over the common regions of alteration identified a novel subtype of high-grade estrogen receptor (ER)-negative breast cancer, characterized by a low genomic instability index. We were able to validate the existence of this genomic subtype in one external breast cancer cohort. Using matched array expression data we also identified the genomic regions showing the strongest coordinate expression changes ('hotspots'). We show that several of these hotspots are located in the phosphatome, kinome and chromatinome, and harbor members of the 122-breast cancer CAN-list. Furthermore, we identify frequently amplified hotspots on 8q22.3 (EDD1, WDSOF1), 8q24.11-13 (THRAP6, DCC1, SQLE, SPG8) and 11q14.1 (NDUFC2, ALG8, USP35) associated with significantly worse prognosis. Amplification of any of these regions identified 37 samples with significantly worse overall survival (hazard ratio (HR) = 2.3 (1.3-1.4) p = 0.003) and time to distant metastasis (HR = 2.6 (1.4-5.1) p = 0.004) independently of NPI.
We present strong evidence for the existence of a novel subtype of high-grade ER-negative tumors that is characterized by a low genomic instability index. We also provide a genome-wide list of common copy number alteration regions in breast cancer that show strong coordinate aberrant expression, and further identify novel frequently amplified regions that correlate with poor prognosis. Many of the genes associated with these regions represent likely novel oncogenes or tumor suppressors.
PMCID: PMC2246289  PMID: 17925008
16.  Multiclass relevance units machine: benchmark evaluation and application to small ncRNA discovery 
BMC Genomics  2013;14(Suppl 2):S6.
Classification is the problem of assigning each input object to one of a finite number of classes. This problem has been extensively studied in machine learning and statistics, and there are numerous applications to bioinformatics as well as many other fields. Building a multiclass classifier has been a challenge, where the direct approach of altering the binary classification algorithm to accommodate more than two classes can be computationally too expensive. Hence the indirect approach of using binary decomposition has been commonly used, in which retrieving the class posterior probabilities from the set of binary posterior probabilities given by the individual binary classifiers has been a major issue.
In this work, we present an extension of a recently introduced probabilistic kernel-based learning algorithm called the Classification Relevance Units Machine (CRUM) to the multiclass setting to increase its applicability. The extension is achieved under the error correcting output codes framework. The probabilistic outputs of the binary CRUMs are preserved using a proposed linear-time decoding algorithm, an alternative to the generalized Bradley-Terry (GBT) algorithm whose application to large-scale prediction settings is prohibited by its computational complexity. The resulting classifier is called the Multiclass Relevance Units Machine (McRUM).
The evaluation of McRUM on a variety of real small-scale benchmark datasets shows that our proposed Naïve decoding algorithm is computationally more efficient than the GBT algorithm while maintaining a similar level of predictive accuracy. Then a set of experiments on a larger scale dataset for small ncRNA classification have been conducted with Naïve McRUM and compared with the Gaussian and linear SVM. Although McRUM's predictive performance is slightly lower than the Gaussian SVM, the results show that the similar level of true positive rate can be achieved by sacrificing false positive rate slightly. Furthermore, McRUM is computationally more efficient than the SVM, which is an important factor for large-scale analysis.
We have proposed McRUM, a multiclass extension of binary CRUM. McRUM with Naïve decoding algorithm is computationally efficient in run-time and its predictive performance is comparable to the well-known SVM, showing its potential in solving large-scale multiclass problems in bioinformatics and other fields of study.
PMCID: PMC3582431  PMID: 23445533
17.  Multi-locus Association Testing with Penalized Regression 
Genetic Epidemiology  2011;35(8):755-765.
In multi-locus association analysis, since some markers may not be associated with a trait, it seems attractive to use penalized regression with the capability of automatic variable selection. On the other hand, in spite of a rapidly growing body of literature on penalized regression, most focus on variable selection and outcome prediction, for which penalized methods are generally more effective than their non-penalized counterparts. However, for statistical inference, i.e. hypothesis testing and interval estimation, it is less clear how penalized methods would perform, or even how to best apply them, largely due to lack of studies on this topic. In our motivating data for a cohort of kidney transplant recipients, it is of primary interest to assess whether a group of genetic variants are associated with a binary clinical outcome, acute rejection at 6 months. In this paper, we study some technical issues and alternative implementations of hypothesis testing in Lasso penalized logistic regression, and compare their performance with each other and with several existing global tests, some of which are specifically designed as variance component tests for high-dimensional data. The most interesting, and perhaps surprising, conclusion of this study is that, for low to moderately high-dimensional data, statistical tests based on Lasso penalized regression are not necessarily more powerful than some existing global tests. In addition, in penalized regression, rather than building a test based on a single selected “best” model, combining multiple tests, each of which is built on a candidate model, might be more promising.
PMCID: PMC3350336  PMID: 21922539
Lasso; Logistic kernel machine regression; Logistic regression; Random-effects model; Score test; Sum of squared score (SSU) test
18.  Transcription Elongation and Tissue-Specific Somatic CAG Instability 
PLoS Genetics  2012;8(11):e1003051.
The expansion of CAG/CTG repeats is responsible for many diseases, including Huntington's disease (HD) and myotonic dystrophy 1. CAG/CTG expansions are unstable in selective somatic tissues, which accelerates disease progression. The mechanisms underlying repeat instability are complex, and it remains unclear whether chromatin structure and/or transcription contribute to somatic CAG/CTG instability in vivo. To address these issues, we investigated the relationship between CAG instability, chromatin structure, and transcription at the HD locus using the R6/1 and R6/2 HD transgenic mouse lines. These mice express a similar transgene, albeit integrated at a different site, and recapitulate HD tissue-specific instability. We show that instability rates are increased in R6/2 tissues as compared to R6/1 matched-samples. High transgene expression levels and chromatin accessibility correlated with the increased CAG instability of R6/2 mice. Transgene mRNA and H3K4 trimethylation at the HD locus were increased, whereas H3K9 dimethylation was reduced in R6/2 tissues relative to R6/1 matched-tissues. However, the levels of transgene expression and these specific histone marks were similar in the striatum and cerebellum, two tissues showing very different CAG instability levels, irrespective of mouse line. Interestingly, the levels of elongating RNA Pol II at the HD locus, but not the initiating form of RNA Pol II, were tissue-specific and correlated with CAG instability levels. Similarly, H3K36 trimethylation, a mark associated with transcription elongation, was specifically increased at the HD locus in the striatum and not in the cerebellum. Together, our data support the view that transcription modulates somatic CAG instability in vivo. More specifically, our results suggest for the first time that transcription elongation is regulated in a tissue-dependent manner, contributing to tissue-selective CAG instability.
Author Summary
Several dominant genetic diseases, including Huntington's disease (HD) and myotonic dystrophy 1, are caused by the expansion of CAG/CTG repeats. These repeats are unstable in selective tissues. Repeat instability, resulting in the production of increasingly toxic mutant entities in the affected tissues, is proposed to accelerate disease progression. It is therefore essential to unravel the mechanisms contributing to tissue-selective somatic instability. In vitro and cell-based studies indicate that transcription is involved in CAG/CTG instability. However, the role of transcription in CAG/CTG instability in vivo has remained controversial, since mRNA tissue levels of CAG/CTG repeat-containing genes do not correlate with the tissue-specific pattern of instability. Moreover, it is unclear whether the transcriptional process would contribute per se to CAG/CTG instability or whether it is the increased chromatin accessibility associated with transcription that would promote instability. We addressed these issues using two HD transgenic mouse lines recapitulating HD tissue-specific instability. Our in vivo data indicate that high chromatin accessibility and transgene expression do not underlie tissue-selective CAG instability in HD, but suggest that the dynamics of transcription elongation is one mechanism contributing to this process.
PMCID: PMC3510035  PMID: 23209427
19.  Allele-Specific Amplification in Cancer Revealed by SNP Array Analysis 
PLoS Computational Biology  2005;1(6):e65.
Amplification, deletion, and loss of heterozygosity of genomic DNA are hallmarks of cancer. In recent years a variety of studies have emerged measuring total chromosomal copy number at increasingly high resolution. Similarly, loss-of-heterozygosity events have been finely mapped using high-throughput genotyping technologies. We have developed a probe-level allele-specific quantitation procedure that extracts both copy number and allelotype information from single nucleotide polymorphism (SNP) array data to arrive at allele-specific copy number across the genome. Our approach applies an expectation-maximization algorithm to a model derived from a novel classification of SNP array probes. This method is the first to our knowledge that is able to (a) determine the generalized genotype of aberrant samples at each SNP site (e.g., CCCCT at an amplified site), and (b) infer the copy number of each parental chromosome across the genome. With this method, we are able to determine not just where amplifications and deletions occur, but also the haplotype of the region being amplified or deleted. The merit of our model and general approach is demonstrated by very precise genotyping of normal samples, and our allele-specific copy number inferences are validated using PCR experiments. Applying our method to a collection of lung cancer samples, we are able to conclude that amplification is essentially monoallelic, as would be expected under the mechanisms currently believed responsible for gene amplification. This suggests that a specific parental chromosome may be targeted for amplification, whether because of germ line or somatic variation. An R software package containing the methods described in this paper is freely available at
Human cancer is driven by the acquisition of genomic alterations. These alterations include amplifications and deletions of portions of one or both chromosomes in the cell. The localization of such copy number changes is an important pursuit in cancer genomics research because amplifications frequently harbor cancer-causing oncogenes, while deleted regions often contain tumor-suppressor genes. In this paper the authors present an expectation-maximization-based procedure that, when applied to data from single nucleotide polymorphism arrays, estimates not only total copy number at high resolution across the genome, but also the contribution of each parental chromosome to copy number. Applying this approach to data from over 100 lung cancer samples the authors find that, in essentially all cases, amplification is monoallelic. That is, only one of the two parental chromosomes contributes to the copy number elevation in each amplified region. This phenomenon makes possible the identification of haplotypes, or patterns of single nucleotide polymorphism alleles, that may serve as markers for the tumor-inducing genetic variants being targeted.
PMCID: PMC1289392  PMID: 16322765
20.  SNP Selection in Genome-Wide Association Studies via Penalized Support Vector Machine with MAX Test 
One of main objectives of a genome-wide association study (GWAS) is to develop a prediction model for a binary clinical outcome using single-nucleotide polymorphisms (SNPs) which can be used for diagnostic and prognostic purposes and for better understanding of the relationship between the disease and SNPs. Penalized support vector machine (SVM) methods have been widely used toward this end. However, since investigators often ignore the genetic models of SNPs, a final model results in a loss of efficiency in prediction of the clinical outcome. In order to overcome this problem, we propose a two-stage method such that the the genetic models of each SNP are identified using the MAX test and then a prediction model is fitted using a penalized SVM method. We apply the proposed method to various penalized SVMs and compare the performance of SVMs using various penalty functions. The results from simulations and real GWAS data analysis show that the proposed method performs better than the prediction methods ignoring the genetic models in terms of prediction power and selectivity.
PMCID: PMC3794570  PMID: 24174989
21.  DNA Damage, Somatic Aneuploidy, and Malignant Sarcoma Susceptibility in Muscular Dystrophies 
PLoS Genetics  2011;7(4):e1002042.
Albeit genetically highly heterogeneous, muscular dystrophies (MDs) share a convergent pathology leading to muscle wasting accompanied by proliferation of fibrous and fatty tissue, suggesting a common MD–pathomechanism. Here we show that mutations in muscular dystrophy genes (Dmd, Dysf, Capn3, Large) lead to the spontaneous formation of skeletal muscle-derived malignant tumors in mice, presenting as mixed rhabdomyo-, fibro-, and liposarcomas. Primary MD–gene defects and strain background strongly influence sarcoma incidence, latency, localization, and gender prevalence. Combined loss of dystrophin and dysferlin, as well as dystrophin and calpain-3, leads to accelerated tumor formation. Irrespective of the primary gene defects, all MD sarcomas share non-random genomic alterations including frequent losses of tumor suppressors (Cdkn2a, Nf1), amplification of oncogenes (Met, Jun), recurrent duplications of whole chromosomes 8 and 15, and DNA damage. Remarkably, these sarcoma-specific genetic lesions are already regularly present in skeletal muscles in aged MD mice even prior to sarcoma development. Accordingly, we show also that skeletal muscle from human muscular dystrophy patients is affected by gross genomic instability, represented by DNA double-strand breaks and age-related accumulation of aneusomies. These novel aspects of molecular pathologies common to muscular dystrophies and tumor biology will potentially influence the strategies to combat these diseases.
Author Summary
All kinds of muscular dystrophies (MDs) are characterized by progressive muscle wasting due to life-long proliferation of precursor cells of myo- (muscle), fibro- (connective tissue), and lipogenic (fat) origin. Despite discovery of many MD genes over the past 25 years, MDs still represent debilitating, incurable diseases, which frequently lead to premature death. Thus, it is imperative to gain novel insights into the underlying MD pathomechanisms. Here, we show that different mouse models for the most common human MDs frequently develop skeletal musculature-associated tumors, presenting as complex sarcomas, consisting of myo-, lipo-, and fibrogenic compartments. Collectively, these tumors are characterized by profound genomic instability such as DNA damage, recurring mutations in cancer genes, and aberrant chromosome copy numbers. We also demonstrate the presence of these cancer-related aberrations in dystrophic muscles from MD mice prior to formation of visible sarcomas. Moreover, we discovered corresponding genomic lesions also in skeletal muscles from human MD patients, as well as stem cells cultured thereof, and show that genomic instability precedes muscle degeneration in MDs. We thus propose that cancer-like genomic instability represents a novel, unifying pathomechanism underlying the entire group of genetically distinct MDs, which will hopefully open new therapeutic avenues.
PMCID: PMC3077392  PMID: 21533183
22.  μ-CS: An extension of the TM4 platform to manage Affymetrix binary data 
BMC Bioinformatics  2010;11:315.
A main goal in understanding cell mechanisms is to explain the relationship among genes and related molecular processes through the combined use of technological platforms and bioinformatics analysis. High throughput platforms, such as microarrays, enable the investigation of the whole genome in a single experiment. There exist different kind of microarray platforms, that produce different types of binary data (images and raw data). Moreover, also considering a single vendor, different chips are available. The analysis of microarray data requires an initial preprocessing phase (i.e. normalization and summarization) of raw data that makes them suitable for use on existing platforms, such as the TIGR M4 Suite. Nevertheless, the annotations of data with additional information such as gene function, is needed to perform more powerful analysis. Raw data preprocessing and annotation is often performed in a manual and error prone way. Moreover, many available preprocessing tools do not support annotation. Thus novel, platform independent, and possibly open source tools enabling the semi-automatic preprocessing and annotation of microarray data are needed.
The paper presents μ-CS (Microarray Cel file Summarizer), a cross-platform tool for the automatic normalization, summarization and annotation of Affymetrix binary data. μ-CS is based on a client-server architecture. The μ-CS client is provided both as a plug-in of the TIGR M4 platform and as a Java standalone tool and enables users to read, preprocess and analyse binary microarray data, avoiding the manual invocation of external tools (e.g. the Affymetrix Power Tools), the manual loading of preprocessing libraries, and the management of intermediate files. The μ-CS server automatically updates the references to the summarization and annotation libraries that are provided to the μ-CS client before the preprocessing. The μ-CS server is based on the web services technology and can be easily extended to support more microarray vendors (e.g. Illumina).
Thus μ-CS users can directly manage binary data without worrying about locating and invoking the proper preprocessing tools and chip-specific libraries. Moreover, users of the μ-CS plugin for TM4 can manage Affymetrix binary files without using external tools, such as APT (Affymetrix Power Tools) and related libraries. Consequently, μ-CS offers four main advantages: (i) it avoids to waste time for searching the correct libraries, (ii) it reduces possible errors in the preprocessing and further analysis phases, e.g. due to the incorrect choice of parameters or the use of old libraries, (iii) it implements the annotation of preprocessed data, and finally, (iv) it may enhance the quality of further analysis since it provides the most updated annotation libraries. The μ-CS client is freely available as a plugin of the TM4 platform as well as a standalone application at the project web site (
PMCID: PMC2907348  PMID: 20537149
23.  Gene-based interaction analysis by incorporating external linkage disequilibrium information 
Gene–gene interactions have an important role in complex human diseases. Detection of gene–gene interactions has long been a challenge due to their complexity. The standard method aiming at detecting SNP–SNP interactions may be inadequate as it does not model linkage disequilibrium (LD) among SNPs in each gene and may lose power due to a large number of comparisons. To improve power, we propose a principal component (PC)-based framework for gene-based interaction analysis. We analytically derive the optimal weight for both quantitative and binary traits based on pairwise LD information. We then use PCs to summarize the information in each gene and test for interactions between the PCs. We further extend this gene-based interaction analysis procedure to allow the use of imputation dosage scores obtained from a popular imputation software package, MACH, which incorporates multilocus LD information. To evaluate the performance of the gene-based interaction tests, we conducted extensive simulations under various settings. We demonstrate that gene-based interaction tests are more powerful than SNP-based tests when more than two variants interact with each other; moreover, tests that incorporate external LD information are generally more powerful than those that use genotyped markers only. We also apply the proposed gene-based interaction tests to a candidate gene study on high-density lipoprotein. As our method operates at the gene level, it can be applied to a genome-wide association setting and used as a screening tool to detect gene–gene interactions.
PMCID: PMC3025792  PMID: 20924406
gene–gene interaction; linkage disequilibrium; imputation
24.  A Novel Model to Combine Clinical and Pathway-Based Transcriptomic Information for the Prognosis Prediction of Breast Cancer 
PLoS Computational Biology  2014;10(9):e1003851.
Breast cancer is the most common malignancy in women worldwide. With the increasing awareness of heterogeneity in breast cancers, better prediction of breast cancer prognosis is much needed for more personalized treatment and disease management. Towards this goal, we have developed a novel computational model for breast cancer prognosis by combining the Pathway Deregulation Score (PDS) based pathifier algorithm, Cox regression and L1-LASSO penalization method. We trained the model on a set of 236 patients with gene expression data and clinical information, and validated the performance on three diversified testing data sets of 606 patients. To evaluate the performance of the model, we conducted survival analysis of the dichotomized groups, and compared the areas under the curve based on the binary classification. The resulting prognosis genomic model is composed of fifteen pathways (e.g. P53 pathway) that had previously reported cancer relevance, and it successfully differentiated relapse in the training set (log rank p-value = 6.25e-12) and three testing data sets (log rank p-value<0.0005). Moreover, the pathway-based genomic models consistently performed better than gene-based models on all four data sets. We also find strong evidence that combining genomic information with clinical information improved the p-values of prognosis prediction by at least three orders of magnitude in comparison to using either genomic or clinical information alone. In summary, we propose a novel prognosis model that harnesses the pathway-based dysregulation as well as valuable clinical information. The selected pathways in our prognosis model are promising targets for therapeutic intervention.
Author Summary
With the increasing awareness of heterogeneity in breast cancers, better prediction of breast cancer prognosis is much needed early on for more personalized treatment and management. Towards this goal we propose in this study a novel pathway-based prognosis prediction model, which emphasizes on individualized pathway-based risk measurement using the pathway dysregulation score (PDS). In combination with the L1-LASSO penalized feature selection and the COX-Proportional Hazards regression model, we have identified fifteen cancer relevant pathways using the pathway-based genomic model that successfully differentiated the relapse in the training set as well as three diversified test sets. Moreover, given the debate whether higher-order representative features, such as GO sets, pathways and network modules are superior to the gene-level features in the genomic models, we demonstrate that pathway-based genomic models consistently performed better than gene-based models in all four data sets. Last but not least, we show strong evidence that models that combine genomic information with clinical information improves the prognosis prediction significantly, in comparison to models that use either genomic or clinical information alone.
PMCID: PMC4168973  PMID: 25233347
25.  Impact of chromosomal instability on colorectal cancer progression and outcome 
BMC Cancer  2014;14:121.
It remains presently unclear whether disease progression in colorectal carcinoma (CRC), from early, to invasive and metastatic forms, is associated to a gradual increase in genetic instability and to a scheme of sequentially occurring Copy Number Alterations (CNAs).
In this work we set to determine the existence of such links between CRC progression and genetic instability and searched for associations with patient outcome. To this aim we analyzed a set of 162 Chromosomal Instable (CIN) CRCs comprising 131 primary carcinomas evenly distributed through stage 1 to 4, 31 metastases and 14 adenomas by array-CGH. CNA profiles were established according to disease stage and compared. We, also, asked whether the level of genomic instability was correlated to disease outcome in stage 2 and 3 CRCs. Two metrics of chromosomal instability were used; (i) Global Genomic Index (GGI), corresponding to the fraction of the genome involved in CNA, (ii) number of breakpoints (nbBP).
Stage 1, 2, 3 and 4 tumors did not differ significantly at the level of their CNA profiles precluding the conventional definition of a progression scheme based on increasing levels of genetic instability. Combining GGI and nbBP,we classified genomic profiles into 5 groups presenting distinct patterns of chromosomal instability and defined two risk classes of tumors, showing strong differences in outcome and hazard risk (RFS: p = 0.012, HR = 3; OS: p < 0.001, HR = 9.7). While tumors of the high risk group were characterized by frequent fractional CNAs, low risk tumors presented predominantly whole chromosomal arm CNAs. Searching for CNAs correlating with negative outcome we found that losses at 16p13.3 and 19q13.3 observed in 10% (7/72) of stage 2–3 tumors showed strong association with early relapse (p < 0.001) and death (p < 0.007, p < 0.016). Both events showed frequent co-occurrence (p < 1x10-8) and could, therefore, mark for stage 2–3 CRC susceptible to negative outcome.
Our data show that CRC disease progression from stage 1 to stage 4 is not paralleled by increased levels of genetic instability. However, they suggest that stage 2–3 CRC with elevated genetic instability and particularly profiles with fractional CNA represent a subset of aggressive tumors.
PMCID: PMC4233623  PMID: 24559140
Colorectal cancer; Genomic instability; Breakpoint; Array CGH; CIN tumors; Adenoma; Primary tumors; Metastasis; Outcome; 16p13.3; 19q13.3

Results 1-25 (1393486)