PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (1465727)

Clipboard (0)
None

Related Articles

1.  A Novel Mechanism Inducing Genome Instability in Kaposi's Sarcoma-Associated Herpesvirus Infected Cells 
PLoS Pathogens  2014;10(5):e1004098.
Kaposi's sarcoma-associated herpesvirus (KSHV) is an oncogenic herpesvirus associated with multiple AIDS-related malignancies. Like other herpesviruses, KSHV has a biphasic life cycle and both the lytic and latent phases are required for tumorigenesis. Evidence suggests that KSHV lytic replication can cause genome instability in KSHV-infected cells, although no mechanism has thus far been described. A surprising link has recently been suggested between mRNA export, genome instability and cancer development. Notably, aberrations in the cellular transcription and export complex (hTREX) proteins have been identified in high-grade tumours and these defects contribute to genome instability. We have previously shown that the lytically expressed KSHV ORF57 protein interacts with the complete hTREX complex; therefore, we investigated the possible intriguing link between ORF57, hTREX and KSHV-induced genome instability. Herein, we show that lytically active KSHV infected cells induce a DNA damage response and, importantly, we demonstrate directly that this is due to DNA strand breaks. Furthermore, we show that sequestration of the hTREX complex by the KSHV ORF57 protein leads to this double strand break response and significant DNA damage. Moreover, we describe a novel mechanism showing that the genetic instability observed is a consequence of R-loop formation. Importantly, the link between hTREX sequestration and DNA damage may be a common feature in herpesvirus infection, as a similar phenotype was observed with the herpes simplex virus 1 (HSV-1) ICP27 protein. Our data provide a model of R-loop induced DNA damage in KSHV infected cells and describes a novel system for studying genome instability caused by aberrant hTREX.
Author Summary
The hallmarks of cancer comprise the essential elements that permit the formation and development of human tumours. Genome instability is an enabling characteristic that allows the progression of tumorigenesis through genetic mutation and therefore, understanding the molecular causes of genome instability in all cancers is essential for development of therapeutics. The Kaposi's sarcoma-associated herpesvirus (KSHV) is an important human pathogen that causes multiple AIDS-related cancers. Recent studies have shown that during KSHV infection, cells show an increase in a double-strand DNA break marker, signifying a severe form of genome instability. Herein, we show that KSHV infection does cause DNA strand breaks. Moreover, we describe a novel molecular mechanism for genome instability involving the KSHV ORF57 protein interacting with the mRNA export complex, hTREX. We demonstrate that over-expression of ORF57 results in the formation of RNA:DNA hybrids, or R-loops, that lead to an increase in genome instability. DNA strand breaks have been previously reported in herpes simplex, cytomegalovirus and Epstein-Barr virus infected cells. Therefore, as this work describes for the first time the mechanism of R-loop induced genome instability involving a conserved herpesvirus protein, it may have far-reaching implications for other viral RNA export factors.
doi:10.1371/journal.ppat.1004098
PMCID: PMC4006916  PMID: 24788796
2.  Sister chromatid cohesion defects are associated with chromosome instability in Hodgkin lymphoma cells 
BMC Cancer  2013;13:391.
Background
Chromosome instability manifests as an abnormal chromosome complement and is a pathogenic event in cancer. Although a correlation between abnormal chromosome numbers and cancer exist, the underlying mechanisms that cause chromosome instability are poorly understood. Recent data suggests that aberrant sister chromatid cohesion causes chromosome instability and thus contributes to the development of cancer. Cohesion normally functions by tethering nascently synthesized chromatids together to prevent premature segregation and thus chromosome instability. Although the prevalence of aberrant cohesion has been reported for some solid tumors, its prevalence within liquid tumors is unknown. Consequently, the current study was undertaken to evaluate aberrant cohesion within Hodgkin lymphoma, a lymphoid malignancy that frequently exhibits chromosome instability.
Methods
Using established cytogenetic techniques, the prevalence of chromosome instability and aberrant cohesion was examined within mitotic spreads generated from five commonly employed Hodgkin lymphoma cell lines (L-1236, KM-H2, L-428, L-540 and HDLM-2) and a lymphocyte control. Indirect immunofluorescence and Western blot analyses were performed to evaluate the localization and expression of six critical proteins involved in the regulation of sister chromatid cohesion.
Results
We first confirmed that all five Hodgkin lymphoma cell lines exhibited chromosome instability relative to the lymphocyte control. We then determined that each Hodgkin lymphoma cell line exhibited cohesion defects that were subsequently classified into mild, moderate or severe categories. Surprisingly, ~50% of the mitotic spreads generated from L-540 and HDLM-2 harbored cohesion defects. To gain mechanistic insight into the underlying cause of the aberrant cohesion we examined the localization and expression of six critical proteins involved in cohesion. Although all proteins produced the expected nuclear localization pattern, striking differences in RAD21 expression was observed: RAD21 expression was lowest in L-540 and highest within HDLM-2.
Conclusion
We conclude that aberrant cohesion is a common feature of all five Hodgkin lymphoma cell lines evaluated. We further conclude that aberrant RAD21 expression is a strong candidate to underlie aberrant cohesion, chromosome instability and contribute to the development of the disease. Our findings support a growing body of evidence suggesting that cohesion defects and aberrant RAD21 expression are pathogenic events that contribute to tumor development.
doi:10.1186/1471-2407-13-391
PMCID: PMC3751861  PMID: 23962039
Hodgkin lymphoma; Chromosome instability; Sister chromatid cohesion; HDLM-2; L-540; RAD21
3.  A novel approach to investigate tissue-specific trinucleotide repeat instability 
BMC Systems Biology  2010;4:29.
Background
In Huntington's disease (HD), an expanded CAG repeat produces characteristic striatal neurodegeneration. Interestingly, the HD CAG repeat, whose length determines age at onset, undergoes tissue-specific somatic instability, predominant in the striatum, suggesting that tissue-specific CAG length changes could modify the disease process. Therefore, understanding the mechanisms underlying the tissue specificity of somatic instability may provide novel routes to therapies. However progress in this area has been hampered by the lack of sensitive high-throughput instability quantification methods and global approaches to identify the underlying factors.
Results
Here we describe a novel approach to gain insight into the factors responsible for the tissue specificity of somatic instability. Using accurate genetic knock-in mouse models of HD, we developed a reliable, high-throughput method to quantify tissue HD CAG repeat instability and integrated this with genome-wide bioinformatic approaches. Using tissue instability quantified in 16 tissues as a phenotype and tissue microarray gene expression as a predictor, we built a mathematical model and identified a gene expression signature that accurately predicted tissue instability. Using the predictive ability of this signature we found that somatic instability was not a consequence of pathogenesis. In support of this, genetic crosses with models of accelerated neuropathology failed to induce somatic instability. In addition, we searched for genes and pathways that correlated with tissue instability. We found that expression levels of DNA repair genes did not explain the tissue specificity of somatic instability. Instead, our data implicate other pathways, particularly cell cycle, metabolism and neurotransmitter pathways, acting in combination to generate tissue-specific patterns of instability.
Conclusion
Our study clearly demonstrates that multiple tissue factors reflect the level of somatic instability in different tissues. In addition, our quantitative, genome-wide approach is readily applicable to high-throughput assays and opens the door to widespread applications with the potential to accelerate the discovery of drugs that alter tissue instability.
doi:10.1186/1752-0509-4-29
PMCID: PMC2856555  PMID: 20302627
4.  Assessing the Significance of Conserved Genomic Aberrations Using High Resolution Genomic Microarrays 
PLoS Genetics  2007;3(8):e143.
Genomic aberrations recurrent in a particular cancer type can be important prognostic markers for tumor progression. Typically in early tumorigenesis, cells incur a breakdown of the DNA replication machinery that results in an accumulation of genomic aberrations in the form of duplications, deletions, translocations, and other genomic alterations. Microarray methods allow for finer mapping of these aberrations than has previously been possible; however, data processing and analysis methods have not taken full advantage of this higher resolution. Attention has primarily been given to analysis on the single sample level, where multiple adjacent probes are necessarily used as replicates for the local region containing their target sequences. However, regions of concordant aberration can be short enough to be detected by only one, or very few, array elements. We describe a method called Multiple Sample Analysis for assessing the significance of concordant genomic aberrations across multiple experiments that does not require a-priori definition of aberration calls for each sample. If there are multiple samples, representing a class, then by exploiting the replication across samples our method can detect concordant aberrations at much higher resolution than can be derived from current single sample approaches. Additionally, this method provides a meaningful approach to addressing population-based questions such as determining important regions for a cancer subtype of interest or determining regions of copy number variation in a population. Multiple Sample Analysis also provides single sample aberration calls in the locations of significant concordance, producing high resolution calls per sample, in concordant regions. The approach is demonstrated on a dataset representing a challenging but important resource: breast tumors that have been formalin-fixed, paraffin-embedded, archived, and subsequently UV-laser capture microdissected and hybridized to two-channel BAC arrays using an amplification protocol. We demonstrate the accurate detection on simulated data, and on real datasets involving known regions of aberration within subtypes of breast cancer at a resolution consistent with that of the array. Similarly, we apply our method to previously published datasets, including a 250K SNP array, and verify known results as well as detect novel regions of concordant aberration. The algorithm has been fully implemented and tested and is freely available as a Java application at http://www.cbil.upenn.edu/MSA.
Author Summary
Cancer is a genetic disease caused by genomic mutations that confer an increased ability to proliferate and survive in a specific environment. It is now known that many regions of genomic DNA are deleted or amplified in specific cancer types. These aberrations are believed to occur randomly in the genome. If these aberrations overlap more than would be expected by chance across individual occurrences of the cancer this suggests a selective pressure on this aberration. These conserved aberrations likely represent regions that are important for the development, progression, and survival of a specific cancer type in its environment. We present a method for identifying these conserved aberrations within a class of samples. The applications for this method include accurate high resolution mapping of aberrations characteristic of cancer subtypes as well as other genetic diseases and determination of conserved copy number variations in the population. With the use of high resolution microarray methods we have profiled different tumor types. We have been able to create high resolution profiles of conserved aberrations in specific cancer types. These conserved aberrations are prime targets for cancer therapies and many of these regions have already been used to develop effective cancer therapeutics.
doi:10.1371/journal.pgen.0030143
PMCID: PMC1950957  PMID: 17722985
5.  Structural and functional protein network analyses predict novel signaling functions for rhodopsin 
Proteomic analyses, literature mining, and structural data were combined to generate an extensive signaling network linked to the visual G protein-coupled receptor rhodopsin. Network analysis suggests novel signaling routes to cytoskeleton dynamics and vesicular trafficking.
Using a shotgun proteomic approach, we identified the protein inventory of the light sensing outer segment of the mammalian photoreceptor.These data, combined with literature mining, structural modeling, and computational analysis, offer a comprehensive view of signal transduction downstream of the visual G protein-coupled receptor rhodopsin.The network suggests novel signaling branches downstream of rhodopsin to cytoskeleton dynamics and vesicular trafficking.The network serves as a basis for elucidating physiological principles of photoreceptor function and suggests potential disease-associated proteins.
Photoreceptor cells are neurons capable of converting light into electrical signals. The rod outer segment (ROS) region of the photoreceptor cells is a cellular structure made of a stack of around 800 closed membrane disks loaded with rhodopsin (Liang et al, 2003; Nickell et al, 2007). In disc membranes, rhodopsin arranges itself into paracrystalline dimer arrays, enabling optimal association with the heterotrimeric G protein transducin as well as additional regulatory components (Ciarkowski et al, 2005). Disruption of these highly regulated structures and processes by germline mutations is the cause of severe blinding diseases such as retinitis pigmentosa, macular degeneration, or congenital stationary night blindness (Berger et al, 2010).
Traditionally, signal transduction networks have been studied by combining biochemical and genetic experiments addressing the relations among a small number of components. More recently, large throughput experiments using different techniques like two hybrid or co-immunoprecipitation coupled to mass spectrometry have added a new level of complexity (Ito et al, 2001; Gavin et al, 2002, 2006; Ho et al, 2002; Rual et al, 2005; Stelzl et al, 2005). However, in these studies, space, time, and the fact that many interactions detected for a particular protein are not compatible, are not taken into consideration. Structural information can help discriminate between direct and indirect interactions and more importantly it can determine if two or more predicted partners of any given protein or complex can simultaneously bind a target or rather compete for the same interaction surface (Kim et al, 2006).
In this work, we build a functional and dynamic interaction network centered on rhodopsin on a systems level, using six steps: In step 1, we experimentally identified the proteomic inventory of the porcine ROS, and we compared our data set with a recent proteomic study from bovine ROS (Kwok et al, 2008). The union of the two data sets was defined as the ‘initial experimental ROS proteome'. After removal of contaminants and applying filtering methods, a ‘core ROS proteome', consisting of 355 proteins, was defined.
In step 2, proteins of the core ROS proteome were assigned to six functional modules: (1) vision, signaling, transporters, and channels; (2) outer segment structure and morphogenesis; (3) housekeeping; (4) cytoskeleton and polarity; (5) vesicles formation and trafficking, and (6) metabolism.
In step 3, a protein-protein interaction network was constructed based on the literature mining. Since for most of the interactions experimental evidence was co-immunoprecipitation, or pull-down experiments, and in addition many of the edges in the network are supported by single experimental evidence, often derived from high-throughput approaches, we refer to this network, as ‘fuzzy ROS interactome'. Structural information was used to predict binary interactions, based on the finding that similar domain pairs are likely to interact in a similar way (‘nature repeats itself') (Aloy and Russell, 2002). To increase the confidence in the resulting network, edges supported by a single evidence not coming from yeast two-hybrid experiments were removed, exception being interactions where the evidence was the existence of a three-dimensional structure of the complex itself, or of a highly homologous complex. This curated static network (‘high-confidence ROS interactome') comprises 660 edges linking the majority of the nodes. By considering only edges supported by at least one evidence of direct binary interaction, we end up with a ‘high-confidence binary ROS interactome'. We next extended the published core pathway (Dell'Orco et al, 2009) using evidence from our high-confidence network. We find several new direct binary links to different cellular functional processes (Figure 4): the active rhodopsin interacts with Rac1 and the GTP form of Rho. There is also a connection between active rhodopsin and Arf4, as well as PDEδ with Rab13 and the GTP-bound form of Arl3 that links the vision cycle to vesicle trafficking and structure. We see a connection between PDEδ with prenyl-modified proteins, such as several small GTPases, as well as with rhodopsin kinase. Further, our network reveals several direct binary connections between Ca2+-regulated proteins and cytoskeleton proteins; these are CaMK2A with actinin, calmodulin with GAP43 and S1008, and PKC with 14-3-3 family members.
In step 4, part of the network was experimentally validated using three different approaches to identify physical protein associations that would occur under physiological conditions: (i) Co-segregation/co-sedimentation experiments, (ii) immunoprecipitations combined with mass spectrometry and/or subsequent immunoblotting, and (iii) utilizing the glycosylated N-terminus of rhodopsin to isolate its associated protein partners by Concanavalin A affinity purification. In total, 60 co-purification and co-elution experiments supported interactions that were already in our literature network, and new evidence from 175 co-IP experiments in this work was added. Next, we aimed to provide additional independent experimental confirmation for two of the novel networks and functional links proposed based on the network analysis: (i) the proposed complex between Rac1/RhoA/CRMP-2/tubulin/and ROCK II in ROS was investigated by culturing retinal explants in the presence of an ROCK II-specific inhibitor (Figure 6). While morphology of the retinas treated with ROCK II inhibitor appeared normal, immunohistochemistry analyses revealed several alterations on the protein level. (ii) We supported the hypothesis that PDEδ could function as a GDI for Rac1 in ROS, by demonstrating that PDEδ and Rac1 co localize in ROS and that PDEδ could dissociate Rac1 from ROS membranes in vitro.
In step 5, we use structural information to distinguish between mutually compatible (‘AND') or excluded (‘XOR') interactions. This enables breaking a network of nodes and edges into functional machines or sub-networks/modules. In the vision branch, both ‘AND' and ‘XOR' gates synergize. This may allow dynamic tuning of light and dark states. However, all connections from the vision module to other modules are ‘XOR' connections suggesting that competition, in connection with local protein concentration changes, could be important for transmitting signals from the core vision module.
In the last step, we map and functionally characterize the known mutations that produce blindness.
In summary, this represents the first comprehensive, dynamic, and integrative rhodopsin signaling network, which can be the basis for integrating and mapping newly discovered disease mutants, to guide protein or signaling branch-specific therapies.
Orchestration of signaling, photoreceptor structural integrity, and maintenance needed for mammalian vision remain enigmatic. By integrating three proteomic data sets, literature mining, computational analyses, and structural information, we have generated a multiscale signal transduction network linked to the visual G protein-coupled receptor (GPCR) rhodopsin, the major protein component of rod outer segments. This network was complemented by domain decomposition of protein–protein interactions and then qualified for mutually exclusive or mutually compatible interactions and ternary complex formation using structural data. The resulting information not only offers a comprehensive view of signal transduction induced by this GPCR but also suggests novel signaling routes to cytoskeleton dynamics and vesicular trafficking, predicting an important level of regulation through small GTPases. Further, it demonstrates a specific disease susceptibility of the core visual pathway due to the uniqueness of its components present mainly in the eye. As a comprehensive multiscale network, it can serve as a basis to elucidate the physiological principles of photoreceptor function, identify potential disease-associated genes and proteins, and guide the development of therapies that target specific branches of the signaling pathway.
doi:10.1038/msb.2011.83
PMCID: PMC3261702  PMID: 22108793
protein interaction network; rhodopsin signaling; structural modeling
6.  Network modeling of the transcriptional effects of copy number aberrations in glioblastoma 
DNA copy number aberrations (CNAs) are a characteristic feature of cancer genomes. In this work, Rebecka Jörnsten, Sven Nelander and colleagues combine network modeling and experimental methods to analyze the systems-level effects of CNAs in glioblastoma.
We introduce a modeling approach termed EPoC (Endogenous Perturbation analysis of Cancer), enabling the construction of global, gene-level models that causally connect gene copy number with expression in glioblastoma.On the basis of the resulting model, we predict genes that are likely to be disease-driving and validate selected predictions experimentally. We also demonstrate that further analysis of the network model by sparse singular value decomposition allows stratification of patients with glioblastoma into short-term and long-term survivors, introducing decomposed network models as a useful principle for biomarker discovery.Finally, in systematic comparisons, we demonstrate that EPoC is computationally efficient and yields more consistent results than mRNA-only methods, standard eQTL methods, and two recent multivariate methods for genotype–mRNA coupling.
Gains and losses of chromosomal material (DNA copy number aberrations; CNAs) are a characteristic feature of cancer genomes. At the level of a single locus, it is well known that increased copy number (gene amplification) typically leads to increased gene expression, whereas decreased copy number (gene deletion) leads to decreased gene expression (Pollack et al, 2002; Lee et al, 2008; Nilsson et al, 2008). However, CNAs also affect the expression of genes located outside the amplified/deleted region itself via indirect mechanisms. To fully understand the action of CNAs, it is therefore necessary to analyze their action in a network context. Toward this goal, improved computational approaches will be important, if not essential.
To determine the global effects on transcription of CNAs in the brain tumor glioblastoma, we develop EPoC (Endogenous Perturbation analysis of Cancer), a computational technique capable of inferring sparse, causal network models by combining genome-wide, paired CNA- and mRNA-level data. EPoC aims to detect disease-driving copy number aberrations and their effect on target mRNA expression, and stratify patients into long-term and short-term survivors. Technically, EPoC relates CNA perturbations to mRNA responses by matrix equations, derived from a steady-state approximation of the transcriptional network. Patient prognostic scores are obtained from singular value decompositions of the network matrix. The models are constructed by solving a large-scale, regularized regression problem.
We apply EPoC to glioblastoma data from The Cancer Genome Atlas (TCGA) consortium (186 patients). The identified CNA-driven network comprises 10 672 genes, and contains a number of copy number-altered genes that control multiple downstream genes. Highly connected hub genes include well-known oncogenes and tumor supressor genes that are frequently deleted or amplified in glioblastoma, including EGFR, PDGFRA, CDKN2A and CDKN2B, confirming a clear association between these aberrations and transcriptional variability of these brain tumors. In addition, we identify a number of hub genes that have previously not been associated with glioblastoma, including interferon alpha 1 (IFNA1), myeloid/lymphoid or mixed-lineage leukemia translocated to 10 (MLLT10, a well-known leukemia gene), glutamate decarboxylase 2 GAD2, a postulated glutamate receptor GPR158 and Necdin (NDN). Furthermore, we demonstrate that the network model contains useful information on downstream target genes (including stem cell regulators), and possible drug targets.
We proceed to explore the validity of a small network region experimentally. Introducing experimental perturbations of NDN and other targets in four glioblastoma cell lines (T98G, U-87MG, U-343MG and U-373MG), we confirm several predicted mechanisms. We also demonstrate that the TCGA glioblastoma patients can be stratified into long-term and short-term survivors, using our proposed prognostic scores derived from a singular vector decomposition of the network model. Finally, we compare EPoC to existing methods for mRNA networks analysis and expression quantitative locus methods, and demonstrate that EPoC produces more consistent models between technically independent glioblastoma data sets, and that the EPoC models exhibit better overlap with known protein–protein interaction networks and pathway maps.
In summary, we conclude that large-scale integrative modeling reveals mechanistically and prognostically informative networks in human glioblastoma. Our approach operates at the gene level and our data support that individual hub genes can be identified in practice. Very large aberrations, however, cannot be fully resolved by the current modeling strategy.
DNA copy number aberrations (CNAs) are a hallmark of cancer genomes. However, little is known about how such changes affect global gene expression. We develop a modeling framework, EPoC (Endogenous Perturbation analysis of Cancer), to (1) detect disease-driving CNAs and their effect on target mRNA expression, and to (2) stratify cancer patients into long- and short-term survivors. Our method constructs causal network models of gene expression by combining genome-wide DNA- and RNA-level data. Prognostic scores are obtained from a singular value decomposition of the networks. By applying EPoC to glioblastoma data from The Cancer Genome Atlas consortium, we demonstrate that the resulting network models contain known disease-relevant hub genes, reveal interesting candidate hubs, and uncover predictors of patient survival. Targeted validations in four glioblastoma cell lines support selected predictions, and implicate the p53-interacting protein Necdin in suppressing glioblastoma cell growth. We conclude that large-scale network modeling of the effects of CNAs on gene expression may provide insights into the biology of human cancer. Free software in MATLAB and R is provided.
doi:10.1038/msb.2011.17
PMCID: PMC3101951  PMID: 21525872
cancer biology; cancer genomics; glioblastoma
7.  Kernel-imbedded Gaussian processes for disease classification using microarray gene expression data 
BMC Bioinformatics  2007;8:67.
Background
Designing appropriate machine learning methods for identifying genes that have a significant discriminating power for disease outcomes has become more and more important for our understanding of diseases at genomic level. Although many machine learning methods have been developed and applied to the area of microarray gene expression data analysis, the majority of them are based on linear models, which however are not necessarily appropriate for the underlying connection between the target disease and its associated explanatory genes. Linear model based methods usually also bring in false positive significant features more easily. Furthermore, linear model based algorithms often involve calculating the inverse of a matrix that is possibly singular when the number of potentially important genes is relatively large. This leads to problems of numerical instability. To overcome these limitations, a few non-linear methods have recently been introduced to the area. Many of the existing non-linear methods have a couple of critical problems, the model selection problem and the model parameter tuning problem, that remain unsolved or even untouched. In general, a unified framework that allows model parameters of both linear and non-linear models to be easily tuned is always preferred in real-world applications. Kernel-induced learning methods form a class of approaches that show promising potentials to achieve this goal.
Results
A hierarchical statistical model named kernel-imbedded Gaussian process (KIGP) is developed under a unified Bayesian framework for binary disease classification problems using microarray gene expression data. In particular, based on a probit regression setting, an adaptive algorithm with a cascading structure is designed to find the appropriate kernel, to discover the potentially significant genes, and to make the optimal class prediction accordingly. A Gibbs sampler is built as the core of the algorithm to make Bayesian inferences. Simulation studies showed that, even without any knowledge of the underlying generative model, the KIGP performed very close to the theoretical Bayesian bound not only in the case with a linear Bayesian classifier but also in the case with a very non-linear Bayesian classifier. This sheds light on its broader usability to microarray data analysis problems, especially to those that linear methods work awkwardly. The KIGP was also applied to four published microarray datasets, and the results showed that the KIGP performed better than or at least as well as any of the referred state-of-the-art methods did in all of these cases.
Conclusion
Mathematically built on the kernel-induced feature space concept under a Bayesian framework, the KIGP method presented in this paper provides a unified machine learning approach to explore both the linear and the possibly non-linear underlying relationship between the target features of a given binary disease classification problem and the related explanatory gene expression data. More importantly, it incorporates the model parameter tuning into the framework. The model selection problem is addressed in the form of selecting a proper kernel type. The KIGP method also gives Bayesian probabilistic predictions for disease classification. These properties and features are beneficial to most real-world applications. The algorithm is naturally robust in numerical computation. The simulation studies and the published data studies demonstrated that the proposed KIGP performs satisfactorily and consistently.
doi:10.1186/1471-2105-8-67
PMCID: PMC1821044  PMID: 17328811
8.  Karyotypic Determinants of Chromosome Instability in Aneuploid Budding Yeast 
PLoS Genetics  2012;8(5):e1002719.
Recent studies in cancer cells and budding yeast demonstrated that aneuploidy, the state of having abnormal chromosome numbers, correlates with elevated chromosome instability (CIN), i.e. the propensity of gaining and losing chromosomes at a high frequency. Here we have investigated ploidy- and chromosome-specific determinants underlying aneuploidy-induced CIN by observing karyotype dynamics in fully isogenic aneuploid yeast strains with ploidies between 1N and 2N obtained through a random meiotic process. The aneuploid strains exhibited various levels of whole-chromosome instability (i.e. chromosome gains and losses). CIN correlates with cellular ploidy in an unexpected way: cells with a chromosomal content close to the haploid state are significantly more stable than cells displaying an apparent ploidy between 1.5 and 2N. We propose that the capacity for accurate chromosome segregation by the mitotic system does not scale continuously with an increasing number of chromosomes, but may occur via discrete steps each time a full set of chromosomes is added to the genome. On top of such general ploidy-related effect, CIN is also associated with the presence of specific aneuploid chromosomes as well as dosage imbalance between specific chromosome pairs. Our findings potentially help reconcile the divide between gene-centric versus genome-centric theories in cancer evolution.
Author Summary
Aneuploidy, the state of harboring an unbalanced number of chromosomes, has long been hypothesized to be at the basis of malignant transformation. Recent studies have also shown that aneuploidy is an important form of genome alteration underlying adaptive evolution of cells in response to harsh environments or genetic perturbations. In addition to the profound effect that aneuploidy has on gene expression and phenotype, another feature thought to contribute to aneuploidy's role in cancer and cellular evolution is the heightened chromosome instability of aneuploid cells. Since chromosome instability is the condition of gaining and losing chromosomes at a high frequency, this could lead to a vicious cycle in which aneuploidy could lead to further enhanced genetic diversity. Given the ever-changing and heterogeneous aneuploid cell populations, and the difficulty of separating the effect of aneuploidy from other types of genetic aberrations, the molecular mechanisms underlying aneuploidy-driven chromosome instability have remained largely unexplored. Here we describe the first unbiased and systematic investigation of chromosome instability associated with aneuploid genomes in the budding yeast Saccharomyces cerevisiae. Our results revealed both genome-level and chromosome-specific determinants of chromosome instability in aneuploid yeast. Our findings potentially help explain the molecular mechanism underlying a major source of genome instability in cancer.
doi:10.1371/journal.pgen.1002719
PMCID: PMC3355078  PMID: 22615582
9.  Transcriptome instability in colorectal cancer identified by exon microarray analyses: Associations with splicing factor expression levels and patient survival 
Genome Medicine  2011;3(5):32.
Background
Colorectal cancer (CRC) is a heterogeneous disease that, on the molecular level, can be characterized by inherent genomic instabilities; chromosome instability and microsatellite instability. In the present study we analyze genome-wide disruption of pre-mRNA splicing, and propose transcriptome instability as a characteristic that is analogous to genomic instability on the transcriptome level.
Methods
Exon microarray profiles from two independent series including a total of 160 CRCs were investigated for their relative amounts of exon usage differences. Each exon in each sample was assigned an alternative splicing score calculated by the FIRMA algorithm. Amounts of deviating exon usage per sample were derived from exons with extreme splicing scores.
Results
There was great heterogeneity within both series in terms of sample-wise amounts of deviating exon usage. This was strongly associated with the expression levels of approximately half of 280 splicing factors (54% and 48% of splicing factors were significantly correlated to deviating exon usage amounts in the two series). Samples with high or low amounts of deviating exon usage, associated with overall transcriptome instability, were almost completely separated into their respective groups by hierarchical clustering analysis of splicing factor expression levels in both sample series. Samples showing a preferential tendency towards deviating exon skipping or inclusion were associated with skewed transcriptome instability. There were significant associations between transcriptome instability and reduced patient survival in both sample series. In the test series, patients with skewed transcriptome instability showed the strongest prognostic association (P = 0.001), while a combination of the two characteristics showed the strongest association with poor survival in the validation series (P = 0.03).
Conclusions
We have described transcriptome instability as a characteristic of CRC. This transcriptome instability has associations with splicing factor expression levels and poor patient survival.
doi:10.1186/gm248
PMCID: PMC3219073  PMID: 21619627
10.  A sampling framework for incorporating quantitative mass spectrometry data in protein interaction analysis 
BMC Bioinformatics  2013;14:299.
Background
Comprehensive protein-protein interaction (PPI) maps are a powerful resource for uncovering the molecular basis of genetic interactions and providing mechanistic insights. Over the past decade, high-throughput experimental techniques have been developed to generate PPI maps at proteome scale, first using yeast two-hybrid approaches and more recently via affinity purification combined with mass spectrometry (AP-MS). Unfortunately, data from both protocols are prone to both high false positive and false negative rates. To address these issues, many methods have been developed to post-process raw PPI data. However, with few exceptions, these methods only analyze binary experimental data (in which each potential interaction tested is deemed either observed or unobserved), neglecting quantitative information available from AP-MS such as spectral counts.
Results
We propose a novel method for incorporating quantitative information from AP-MS data into existing PPI inference methods that analyze binary interaction data. Our approach introduces a probabilistic framework that models the statistical noise inherent in observations of co-purifications. Using a sampling-based approach, we model the uncertainty of interactions with low spectral counts by generating an ensemble of possible alternative experimental outcomes. We then apply the existing method of choice to each alternative outcome and aggregate results over the ensemble. We validate our approach on three recent AP-MS data sets and demonstrate performance comparable to or better than state-of-the-art methods. Additionally, we provide an in-depth discussion comparing the theoretical bases of existing approaches and identify common aspects that may be key to their performance.
Conclusions
Our sampling framework extends the existing body of work on PPI analysis using binary interaction data to apply to the richer quantitative data now commonly available through AP-MS assays. This framework is quite general, and many enhancements are likely possible. Fruitful future directions may include investigating more sophisticated schemes for converting spectral counts to probabilities and applying the framework to direct protein complex prediction methods.
doi:10.1186/1471-2105-14-299
PMCID: PMC3851523  PMID: 24093595
11.  PSCC: Sensitive and Reliable Population-Scale Copy Number Variation Detection Method Based on Low Coverage Sequencing 
PLoS ONE  2014;9(1):e85096.
Background
Copy number variations (CNVs) represent an important type of genetic variation that deeply impact phenotypic polymorphisms and human diseases. The advent of high-throughput sequencing technologies provides an opportunity to revolutionize the discovery of CNVs and to explore their relationship with diseases. However, most of the existing methods depend on sequencing depth and show instability with low sequence coverage. In this study, using low coverage whole-genome sequencing (LCS) we have developed an effective population-scale CNV calling (PSCC) method.
Methodology/Principal Findings
In our novel method, two-step correction was used to remove biases caused by local GC content and complex genomic characteristics. We chose a binary segmentation method to locate CNV segments and designed combined statistics tests to ensure the stable performance of the false positive control. The simulation data showed that our PSCC method could achieve 99.7%/100% and 98.6%/100% sensitivity and specificity for over 300 kb CNV calling in the condition of LCS (∼2×) and ultra LCS (∼0.2×), respectively. Finally, we applied this novel method to analyze 34 clinical samples with an average of 2× LCS. In the final results, all the 31 pathogenic CNVs identified by aCGH were successfully detected. In addition, the performance comparison revealed that our method had significant advantages over existing methods using ultra LCS.
Conclusions/Significance
Our study showed that PSCC can sensitively and reliably detect CNVs using low coverage or even ultra-low coverage data through population-scale sequencing.
doi:10.1371/journal.pone.0085096
PMCID: PMC3897425  PMID: 24465483
12.  DNA Damage, Somatic Aneuploidy, and Malignant Sarcoma Susceptibility in Muscular Dystrophies 
PLoS Genetics  2011;7(4):e1002042.
Albeit genetically highly heterogeneous, muscular dystrophies (MDs) share a convergent pathology leading to muscle wasting accompanied by proliferation of fibrous and fatty tissue, suggesting a common MD–pathomechanism. Here we show that mutations in muscular dystrophy genes (Dmd, Dysf, Capn3, Large) lead to the spontaneous formation of skeletal muscle-derived malignant tumors in mice, presenting as mixed rhabdomyo-, fibro-, and liposarcomas. Primary MD–gene defects and strain background strongly influence sarcoma incidence, latency, localization, and gender prevalence. Combined loss of dystrophin and dysferlin, as well as dystrophin and calpain-3, leads to accelerated tumor formation. Irrespective of the primary gene defects, all MD sarcomas share non-random genomic alterations including frequent losses of tumor suppressors (Cdkn2a, Nf1), amplification of oncogenes (Met, Jun), recurrent duplications of whole chromosomes 8 and 15, and DNA damage. Remarkably, these sarcoma-specific genetic lesions are already regularly present in skeletal muscles in aged MD mice even prior to sarcoma development. Accordingly, we show also that skeletal muscle from human muscular dystrophy patients is affected by gross genomic instability, represented by DNA double-strand breaks and age-related accumulation of aneusomies. These novel aspects of molecular pathologies common to muscular dystrophies and tumor biology will potentially influence the strategies to combat these diseases.
Author Summary
All kinds of muscular dystrophies (MDs) are characterized by progressive muscle wasting due to life-long proliferation of precursor cells of myo- (muscle), fibro- (connective tissue), and lipogenic (fat) origin. Despite discovery of many MD genes over the past 25 years, MDs still represent debilitating, incurable diseases, which frequently lead to premature death. Thus, it is imperative to gain novel insights into the underlying MD pathomechanisms. Here, we show that different mouse models for the most common human MDs frequently develop skeletal musculature-associated tumors, presenting as complex sarcomas, consisting of myo-, lipo-, and fibrogenic compartments. Collectively, these tumors are characterized by profound genomic instability such as DNA damage, recurring mutations in cancer genes, and aberrant chromosome copy numbers. We also demonstrate the presence of these cancer-related aberrations in dystrophic muscles from MD mice prior to formation of visible sarcomas. Moreover, we discovered corresponding genomic lesions also in skeletal muscles from human MD patients, as well as stem cells cultured thereof, and show that genomic instability precedes muscle degeneration in MDs. We thus propose that cancer-like genomic instability represents a novel, unifying pathomechanism underlying the entire group of genetically distinct MDs, which will hopefully open new therapeutic avenues.
doi:10.1371/journal.pgen.1002042
PMCID: PMC3077392  PMID: 21533183
13.  Allele-Specific Amplification in Cancer Revealed by SNP Array Analysis 
PLoS Computational Biology  2005;1(6):e65.
Amplification, deletion, and loss of heterozygosity of genomic DNA are hallmarks of cancer. In recent years a variety of studies have emerged measuring total chromosomal copy number at increasingly high resolution. Similarly, loss-of-heterozygosity events have been finely mapped using high-throughput genotyping technologies. We have developed a probe-level allele-specific quantitation procedure that extracts both copy number and allelotype information from single nucleotide polymorphism (SNP) array data to arrive at allele-specific copy number across the genome. Our approach applies an expectation-maximization algorithm to a model derived from a novel classification of SNP array probes. This method is the first to our knowledge that is able to (a) determine the generalized genotype of aberrant samples at each SNP site (e.g., CCCCT at an amplified site), and (b) infer the copy number of each parental chromosome across the genome. With this method, we are able to determine not just where amplifications and deletions occur, but also the haplotype of the region being amplified or deleted. The merit of our model and general approach is demonstrated by very precise genotyping of normal samples, and our allele-specific copy number inferences are validated using PCR experiments. Applying our method to a collection of lung cancer samples, we are able to conclude that amplification is essentially monoallelic, as would be expected under the mechanisms currently believed responsible for gene amplification. This suggests that a specific parental chromosome may be targeted for amplification, whether because of germ line or somatic variation. An R software package containing the methods described in this paper is freely available at http://genome.dfci.harvard.edu/~tlaframb/PLASQ.
Synopsis
Human cancer is driven by the acquisition of genomic alterations. These alterations include amplifications and deletions of portions of one or both chromosomes in the cell. The localization of such copy number changes is an important pursuit in cancer genomics research because amplifications frequently harbor cancer-causing oncogenes, while deleted regions often contain tumor-suppressor genes. In this paper the authors present an expectation-maximization-based procedure that, when applied to data from single nucleotide polymorphism arrays, estimates not only total copy number at high resolution across the genome, but also the contribution of each parental chromosome to copy number. Applying this approach to data from over 100 lung cancer samples the authors find that, in essentially all cases, amplification is monoallelic. That is, only one of the two parental chromosomes contributes to the copy number elevation in each amplified region. This phenomenon makes possible the identification of haplotypes, or patterns of single nucleotide polymorphism alleles, that may serve as markers for the tumor-inducing genetic variants being targeted.
doi:10.1371/journal.pcbi.0010065
PMCID: PMC1289392  PMID: 16322765
14.  Micro-Scale Genomic DNA Copy Number Aberrations as Another Means of Mutagenesis in Breast Cancer 
PLoS ONE  2012;7(12):e51719.
Introduction
In breast cancer, the basal-like subtype has high levels of genomic instability relative to other breast cancer subtypes with many basal-like-specific regions of aberration. There is evidence that this genomic instability extends to smaller scale genomic aberrations, as shown by a previously described micro-deletion event in the PTEN gene in the Basal-like SUM149 breast cancer cell line.
Methods
We sought to identify if small regions of genomic DNA copy number changes exist by using a high density, gene-centric Comparative Genomic Hybridizations (CGH) array on cell lines and primary tumors. A custom tiling array for CGH (244,000 probes, 200 bp tiling resolution) was created to identify small regions of genomic change, which was focused on previously identified basal-like-specific, and general cancer genes. Tumor genomic DNA from 94 patients and 2 breast cancer cell lines was labeled and hybridized to these arrays. Aberrations were called using SWITCHdna and the smallest 25% of SWITCHdna-defined genomic segments were called micro-aberrations (<64 contiguous probes, ∼ 15 kb).
Results
Our data showed that primary tumor breast cancer genomes frequently contained many small-scale copy number gains and losses, termed micro-aberrations, most of which are undetectable using typical-density genome-wide aCGH arrays. The basal-like subtype exhibited the highest incidence of these events. These micro-aberrations sometimes altered expression of the involved gene. We confirmed the presence of the PTEN micro-amplification in SUM149 and by mRNA-seq showed that this resulted in loss of expression of all exons downstream of this event. Micro-aberrations disproportionately affected the 5′ regions of the affected genes, including the promoter region, and high frequency of micro-aberrations was associated with poor survival.
Conclusion
Using a high-probe-density, gene-centric aCGH microarray, we present evidence of small-scale genomic aberrations that can contribute to gene inactivation. These events may contribute to tumor formation through mechanisms not detected using conventional DNA copy number analyses.
doi:10.1371/journal.pone.0051719
PMCID: PMC3524128  PMID: 23284754
15.  SNP Selection in Genome-Wide Association Studies via Penalized Support Vector Machine with MAX Test 
One of main objectives of a genome-wide association study (GWAS) is to develop a prediction model for a binary clinical outcome using single-nucleotide polymorphisms (SNPs) which can be used for diagnostic and prognostic purposes and for better understanding of the relationship between the disease and SNPs. Penalized support vector machine (SVM) methods have been widely used toward this end. However, since investigators often ignore the genetic models of SNPs, a final model results in a loss of efficiency in prediction of the clinical outcome. In order to overcome this problem, we propose a two-stage method such that the the genetic models of each SNP are identified using the MAX test and then a prediction model is fitted using a penalized SVM method. We apply the proposed method to various penalized SVMs and compare the performance of SVMs using various penalty functions. The results from simulations and real GWAS data analysis show that the proposed method performs better than the prediction methods ignoring the genetic models in terms of prediction power and selectivity.
doi:10.1155/2013/340678
PMCID: PMC3794570  PMID: 24174989
16.  A Novel Model to Combine Clinical and Pathway-Based Transcriptomic Information for the Prognosis Prediction of Breast Cancer 
PLoS Computational Biology  2014;10(9):e1003851.
Breast cancer is the most common malignancy in women worldwide. With the increasing awareness of heterogeneity in breast cancers, better prediction of breast cancer prognosis is much needed for more personalized treatment and disease management. Towards this goal, we have developed a novel computational model for breast cancer prognosis by combining the Pathway Deregulation Score (PDS) based pathifier algorithm, Cox regression and L1-LASSO penalization method. We trained the model on a set of 236 patients with gene expression data and clinical information, and validated the performance on three diversified testing data sets of 606 patients. To evaluate the performance of the model, we conducted survival analysis of the dichotomized groups, and compared the areas under the curve based on the binary classification. The resulting prognosis genomic model is composed of fifteen pathways (e.g. P53 pathway) that had previously reported cancer relevance, and it successfully differentiated relapse in the training set (log rank p-value = 6.25e-12) and three testing data sets (log rank p-value<0.0005). Moreover, the pathway-based genomic models consistently performed better than gene-based models on all four data sets. We also find strong evidence that combining genomic information with clinical information improved the p-values of prognosis prediction by at least three orders of magnitude in comparison to using either genomic or clinical information alone. In summary, we propose a novel prognosis model that harnesses the pathway-based dysregulation as well as valuable clinical information. The selected pathways in our prognosis model are promising targets for therapeutic intervention.
Author Summary
With the increasing awareness of heterogeneity in breast cancers, better prediction of breast cancer prognosis is much needed early on for more personalized treatment and management. Towards this goal we propose in this study a novel pathway-based prognosis prediction model, which emphasizes on individualized pathway-based risk measurement using the pathway dysregulation score (PDS). In combination with the L1-LASSO penalized feature selection and the COX-Proportional Hazards regression model, we have identified fifteen cancer relevant pathways using the pathway-based genomic model that successfully differentiated the relapse in the training set as well as three diversified test sets. Moreover, given the debate whether higher-order representative features, such as GO sets, pathways and network modules are superior to the gene-level features in the genomic models, we demonstrate that pathway-based genomic models consistently performed better than gene-based models in all four data sets. Last but not least, we show strong evidence that models that combine genomic information with clinical information improves the prognosis prediction significantly, in comparison to models that use either genomic or clinical information alone.
doi:10.1371/journal.pcbi.1003851
PMCID: PMC4168973  PMID: 25233347
17.  Aneuploidy prediction and tumor classification with heterogeneous hidden conditional random fields 
Bioinformatics  2008;25(10):1307-1313.
Motivation: The heterogeneity of cancer cannot always be recognized by tumor morphology, but may be reflected by the underlying genetic aberrations. Array comparative genome hybridization (array-CGH) methods provide high-throughput data on genetic copy numbers, but determining the clinically relevant copy number changes remains a challenge. Conventional classification methods for linking recurrent alterations to clinical outcome ignore sequential correlations in selecting relevant features. Conversely, existing sequence classification methods can only model overall copy number instability, without regard to any particular position in the genome.
Results: Here, we present the heterogeneous hidden conditional random field, a new integrated array-CGH analysis method for jointly classifying tumors, inferring copy numbers and identifying clinically relevant positions in recurrent alteration regions. By capturing the sequentiality as well as the locality of changes, our integrated model provides better noise reduction, and achieves more relevant gene retrieval and more accurate classification than existing methods. We provide an efficient L1-regularized discriminative training algorithm, which notably selects a small set of candidate genes most likely to be clinically relevant and driving the recurrent amplicons of importance. Our method thus provides unbiased starting points in deciding which genomic regions and which genes in particular to pursue for further examination. Our experiments on synthetic data and real genomic cancer prediction data show that our method is superior, both in prediction accuracy and relevant feature discovery, to existing methods. We also demonstrate that it can be used to generate novel biological hypotheses for breast cancer.
Contact: ogt@cs.princeton.edu
Supplementary information:Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btn585
PMCID: PMC2677736  PMID: 19052061
18.  Group sparse canonical correlation analysis for genomic data integration 
BMC Bioinformatics  2013;14:245.
Background
The emergence of high-throughput genomic datasets from different sources and platforms (e.g., gene expression, single nucleotide polymorphisms (SNP), and copy number variation (CNV)) has greatly enhanced our understandings of the interplay of these genomic factors as well as their influences on the complex diseases. It is challenging to explore the relationship between these different types of genomic data sets. In this paper, we focus on a multivariate statistical method, canonical correlation analysis (CCA) method for this problem. Conventional CCA method does not work effectively if the number of data samples is significantly less than that of biomarkers, which is a typical case for genomic data (e.g., SNPs). Sparse CCA (sCCA) methods were introduced to overcome such difficulty, mostly using penalizations with l-1 norm (CCA-l1) or the combination of l-1and l-2 norm (CCA-elastic net). However, they overlook the structural or group effect within genomic data in the analysis, which often exist and are important (e.g., SNPs spanning a gene interact and work together as a group).
Results
We propose a new group sparse CCA method (CCA-sparse group) along with an effective numerical algorithm to study the mutual relationship between two different types of genomic data (i.e., SNP and gene expression). We then extend the model to a more general formulation that can include the existing sCCA models. We apply the model to feature/variable selection from two data sets and compare our group sparse CCA method with existing sCCA methods on both simulation and two real datasets (human gliomas data and NCI60 data). We use a graphical representation of the samples with a pair of canonical variates to demonstrate the discriminating characteristic of the selected features. Pathway analysis is further performed for biological interpretation of those features.
Conclusions
The CCA-sparse group method incorporates group effects of features into the correlation analysis while performs individual feature selection simultaneously. It outperforms the two sCCA methods (CCA-l1 and CCA-group) by identifying the correlated features with more true positives while controlling total discordance at a lower level on the simulated data, even if the group effect does not exist or there are irrelevant features grouped with true correlated features. Compared with our proposed CCA-group sparse models, CCA-l1 tends to select less true correlated features while CCA-group inclines to select more redundant features.
doi:10.1186/1471-2105-14-245
PMCID: PMC3751310  PMID: 23937249
Group sparse CCA; Genomic data integration; Feature selection; SNP
19.  On the potential of models for location and scale for genome-wide DNA methylation data 
BMC Bioinformatics  2014;15:232.
Background
With the help of epigenome-wide association studies (EWAS), increasing knowledge on the role of epigenetic mechanisms such as DNA methylation in disease processes is obtained. In addition, EWAS aid the understanding of behavioral and environmental effects on DNA methylation. In terms of statistical analysis, specific challenges arise from the characteristics of methylation data. First, methylation β-values represent proportions with skewed and heteroscedastic distributions. Thus, traditional modeling strategies assuming a normally distributed response might not be appropriate. Second, recent evidence suggests that not only mean differences but also variability in site-specific DNA methylation associates with diseases, including cancer. The purpose of this study was to compare different modeling strategies for methylation data in terms of model performance and performance of downstream hypothesis tests. Specifically, we used the generalized additive models for location, scale and shape (GAMLSS) framework to compare beta regression with Gaussian regression on raw, binary logit and arcsine square root transformed methylation data, with and without modeling a covariate effect on the scale parameter.
Results
Using simulated and real data from a large population-based study and an independent sample of cancer patients and healthy controls, we show that beta regression does not outperform competing strategies in terms of model performance. In addition, Gaussian models for location and scale showed an improved performance as compared to models for location only. The best performance was observed for the Gaussian model on binary logit transformed β-values, referred to as M-values. Our results further suggest that models for location and scale are specifically sensitive towards violations of the distribution assumption and towards outliers in the methylation data. Therefore, a resampling procedure is proposed as a mode of inference and shown to diminish type I error rate in practically relevant settings. We apply the proposed method in an EWAS of BMI and age and reveal strong associations of age with methylation variability that are validated in an independent sample.
Conclusions
Models for location and scale are promising tools for EWAS that may help to understand the influence of environmental factors and disease-related phenotypes on methylation variability and its role during disease development.
doi:10.1186/1471-2105-15-232
PMCID: PMC4227139  PMID: 24994026
DNA methylation; Beta regression; GAMLSS; Infinium HumanMethylation450k BeadChip; EWAS; Modeling variability; Resampling; Model performance; Model comparison; Models for location and scale
20.  High-resolution aCGH and expression profiling identifies a novel genomic subtype of ER negative breast cancer 
Genome Biology  2007;8(10):R215.
High resolution array-CGH and expression profiling identifies a novel genomic subtype of ER negative breast cancer, and provides a genome-wide list of common copy number alterations associated with aberrant expression and poor prognosis.
Background
The characterization of copy number alteration patterns in breast cancer requires high-resolution genome-wide profiling of a large panel of tumor specimens. To date, most genome-wide array comparative genomic hybridization studies have used tumor panels of relatively large tumor size and high Nottingham Prognostic Index (NPI) that are not as representative of breast cancer demographics.
Results
We performed an oligo-array-based high-resolution analysis of copy number alterations in 171 primary breast tumors of relatively small size and low NPI, which was therefore more representative of breast cancer demographics. Hierarchical clustering over the common regions of alteration identified a novel subtype of high-grade estrogen receptor (ER)-negative breast cancer, characterized by a low genomic instability index. We were able to validate the existence of this genomic subtype in one external breast cancer cohort. Using matched array expression data we also identified the genomic regions showing the strongest coordinate expression changes ('hotspots'). We show that several of these hotspots are located in the phosphatome, kinome and chromatinome, and harbor members of the 122-breast cancer CAN-list. Furthermore, we identify frequently amplified hotspots on 8q22.3 (EDD1, WDSOF1), 8q24.11-13 (THRAP6, DCC1, SQLE, SPG8) and 11q14.1 (NDUFC2, ALG8, USP35) associated with significantly worse prognosis. Amplification of any of these regions identified 37 samples with significantly worse overall survival (hazard ratio (HR) = 2.3 (1.3-1.4) p = 0.003) and time to distant metastasis (HR = 2.6 (1.4-5.1) p = 0.004) independently of NPI.
Conclusion
We present strong evidence for the existence of a novel subtype of high-grade ER-negative tumors that is characterized by a low genomic instability index. We also provide a genome-wide list of common copy number alteration regions in breast cancer that show strong coordinate aberrant expression, and further identify novel frequently amplified regions that correlate with poor prognosis. Many of the genes associated with these regions represent likely novel oncogenes or tumor suppressors.
doi:10.1186/gb-2007-8-10-r215
PMCID: PMC2246289  PMID: 17925008
21.  Molecular characteristics of serrated adenomas of the colorectum 
Gut  2002;51(2):200-206.
Background: Serrated adenomas (SAs) of the colorectum combine architectural features of hyperplastic polyps and cytological features of classical adenomas. Molecular studies comparing SAs and classical adenomas suggest that each may be a distinct entity; in particular, it has been proposed that microsatellite instability (MSI) distinguishes SAs from classical adenomas and that SAs and the colorectal cancers arising from them develop along a pathway driven by low level microsatellite instability (MSI-L).
Aims: To define the molecular characteristics of SAs of the colorectum.
Materials and methods: We analysed 39 SAs from 27 patients, including eight SAs from patients with familial adenomatous polyposis (FAP). We screened these polyps for selected molecular changes, including loss of heterozygosity (LOH) close to APC (5q21) and CRAC1 (15q13-q22), MSI, and mutations of K-ras, APC, p53, and β-catenin. Expression patterns of β-catenin, p53, MLH1, MSH2, E-cadherin, and O6-methylguanine DNA methyltransferase (MGMT) were assessed by immunohistochemistry. Comparative genomic hybridisation was performed on several polyps.
Results: MSI was rare (<5% cases) and there was no loss of expression of mismatch repair proteins. Wnt pathway abnormalities (APC mutation/LOH, β-catenin mutation/nuclear expression) occurred in 11 SAs, including 6/31 (19%) non-FAP tumours. CRAC1 LOH occurred in 23% of tumours. K-ras mutations and p53 mutations/overexpression were found in 15% and 8% of SAs, respectively. Loss of MGMT expression occurred in 18% of polyps and showed a borderline association with K-ras mutations. Aberrant E-cadherin expression was found in seven polyps. Comparative genomic hybridisation detected no gains or deletions of chromosomal material.
Conclusions: The serrated pathway of colorectal tumorigenesis appears to be heterogeneous. In common with classical adenomas, some SAs develop along pathways involving changes in APC/β-catenin. SAs rarely show MSI or any evidence of chromosomal-scale genetic instability. K-ras mutations may however be less common in SAs than in classical adenomas. Some SAs may harbour changes in the CRAC1 gene. Changes in known genes do not account for the growth of the majority of SAs.
PMCID: PMC1773326  PMID: 12117880
serrated adenoma; microsatellite instability; loss of heterozygosity; immunohistochemistry
22.  A Dominantly Acting Murine Allele of Mcm4 Causes Chromosomal Abnormalities and Promotes Tumorigenesis 
PLoS Genetics  2012;8(11):e1003034.
Here we report the isolation of a murine model for heritable T cell lymphoblastic leukemia/lymphoma (T-ALL) called Spontaneous dominant leukemia (Sdl). Sdl heterozygous mice develop disease with a short latency and high penetrance, while mice homozygous for the mutation die early during embryonic development. Sdl mice exhibit an increase in the frequency of micronucleated reticulocytes, and T-ALLs from Sdl mice harbor small amplifications and deletions, including activating deletions at the Notch1 locus. Using exome sequencing it was determined that Sdl mice harbor a spontaneously acquired mutation in Mcm4 (Mcm4D573H). MCM4 is part of the heterohexameric complex of MCM2–7 that is important for licensing of DNA origins prior to S phase and also serves as the core of the replicative helicase that unwinds DNA at replication forks. Previous studies in murine models have discovered that genetic reductions of MCM complex levels promote tumor formation by causing genomic instability. However, Sdl mice possess normal levels of Mcms, and there is no evidence for loss-of-heterozygosity at the Mcm4 locus in Sdl leukemias. Studies in Saccharomyces cerevisiae indicate that the Sdl mutation produces a biologically inactive helicase. Together, these data support a model in which chromosomal abnormalities in Sdl mice result from the ability of MCM4D573H to incorporate into MCM complexes and render them inactive. Our studies indicate that dominantly acting alleles of MCMs can be compatible with viability but have dramatic oncogenic consequences by causing chromosomal abnormalities.
Author Summary
Our study investigated a spontaneous mouse model for dominantly inherited T-cell leukemia/lymphoma. Using genetic methods, we identified a mutant allele of Mcm4 (Mcm4D573H) in this model. Interestingly, this Mcm4 allele promotes the accumulation of focal chromosomal gains and losses, including aberrations at the Notch1 locus that drive the formation of T-cell leukemia/lymphoma. Previous studies of hypomorphic Mcm alleles have demonstrated that a decrease in MCM levels can cause tumorigenesis. However, total and chromatin bound MCM levels were similar to wild-type in our model, indicating that Mcm alleles that do not drastically impact MCM levels can cause genomic aberrations that drive tumor formation.
doi:10.1371/journal.pgen.1003034
PMCID: PMC3486839  PMID: 23133403
23.  ContrastRank: a new method for ranking putative cancer driver genes and classification of tumor samples 
Bioinformatics  2014;30(17):i572-i578.
Motivation: The recent advance in high-throughput sequencing technologies is generating a huge amount of data that are becoming an important resource for deciphering the genotype underlying a given phenotype. Genome sequencing has been extensively applied to the study of the cancer genomes. Although a few methods have been already proposed for the detection of cancer-related genes, their automatic identification is still a challenging task. Using the genomic data made available by The Cancer Genome Atlas Consortium (TCGA), we propose a new prioritization approach based on the analysis of the distribution of putative deleterious variants in a large cohort of cancer samples.
Results: In this paper, we present ContastRank, a new method for the prioritization of putative impaired genes in cancer. The method is based on the comparison of the putative defective rate of each gene in tumor versus normal and 1000 genome samples. We show that the method is able to provide a ranked list of putative impaired genes for colon, lung and prostate adenocarcinomas. The list significantly overlaps with the list of known cancer driver genes previously published. More importantly, by using our scoring approach, we can successfully discriminate between TCGA normal and tumor samples. A binary classifier based on ContrastRank score reaches an overall accuracy >90% and the area under the curve (AUC) of receiver operating characteristics (ROC) >0.95 for all the three types of adenocarcinoma analyzed in this paper. In addition, using ContrastRank score, we are able to discriminate the three tumor types with a minimum overall accuracy of 77% and AUC of 0.83.
Conclusions: We describe ContrastRank, a method for prioritizing putative impaired genes in cancer. The method is based on the comparison of exome sequencing data from different cohorts and can detect putative cancer driver genes.
ContrastRank can also be used to estimate a global score for an individual genome about the risk of adenocarcinoma based on the genetic variants information from a whole-exome VCF (Variant Calling Format) file. We believe that the application of ContrastRank can be an important step in genomic medicine to enable genome-based diagnosis.
Availability and implementation: The lists of ContrastRank scores of all genes in each tumor type are available as supplementary materials. A webserver for evaluating the risk of the three studied adenocarcinomas starting from whole-exome VCF file is under development.
Contact: emidio@uab.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btu466
PMCID: PMC4147919  PMID: 25161249
24.  An algorithm for classifying tumors based on genomic aberrations and selecting representative tumor models 
BMC Medical Genomics  2010;3:23.
Background
Cancer is a heterogeneous disease caused by genomic aberrations and characterized by significant variability in clinical outcomes and response to therapies. Several subtypes of common cancers have been identified based on alterations of individual cancer genes, such as HER2, EGFR, and others. However, cancer is a complex disease driven by the interaction of multiple genes, so the copy number status of individual genes is not sufficient to define cancer subtypes and predict responses to treatments. A classification based on genome-wide copy number patterns would be better suited for this purpose.
Method
To develop a more comprehensive cancer taxonomy based on genome-wide patterns of copy number abnormalities, we designed an unsupervised classification algorithm that identifies genomic subgroups of tumors. This algorithm is based on a modified genomic Non-negative Matrix Factorization (gNMF) algorithm and includes several additional components, namely a pilot hierarchical clustering procedure to determine the number of clusters, a multiple random initiation scheme, a new stop criterion for the core gNMF, as well as a 10-fold cross-validation stability test for quality assessment.
Result
We applied our algorithm to identify genomic subgroups of three major cancer types: non-small cell lung carcinoma (NSCLC), colorectal cancer (CRC), and malignant melanoma. High-density SNP array datasets for patient tumors and established cell lines were used to define genomic subclasses of the diseases and identify cell lines representative of each genomic subtype. The algorithm was compared with several traditional clustering methods and showed improved performance. To validate our genomic taxonomy of NSCLC, we correlated the genomic classification with disease outcomes. Overall survival time and time to recurrence were shown to differ significantly between the genomic subtypes.
Conclusions
We developed an algorithm for cancer classification based on genome-wide patterns of copy number aberrations and demonstrated its superiority to existing clustering methods. The algorithm was applied to define genomic subgroups of three cancer types and identify cell lines representative of these subgroups. Our data enabled the assembly of representative cell line panels for testing drug candidates.
doi:10.1186/1755-8794-3-23
PMCID: PMC2901344  PMID: 20569491
25.  Genome wide association studies in presence of misclassified binary responses 
BMC Genetics  2013;14:124.
Background
Misclassification has been shown to have a high prevalence in binary responses in both livestock and human populations. Leaving these errors uncorrected before analyses will have a negative impact on the overall goal of genome-wide association studies (GWAS) including reducing predictive power. A liability threshold model that contemplates misclassification was developed to assess the effects of mis-diagnostic errors on GWAS. Four simulated scenarios of case–control datasets were generated. Each dataset consisted of 2000 individuals and was analyzed with varying odds ratios of the influential SNPs and misclassification rates of 5% and 10%.
Results
Analyses of binary responses subject to misclassification resulted in underestimation of influential SNPs and failed to estimate the true magnitude and direction of the effects. Once the misclassification algorithm was applied there was a 12% to 29% increase in accuracy, and a substantial reduction in bias. The proposed method was able to capture the majority of the most significant SNPs that were not identified in the analysis of the misclassified data. In fact, in one of the simulation scenarios, 33% of the influential SNPs were not identified using the misclassified data, compared with the analysis using the data without misclassification. However, using the proposed method, only 13% were not identified. Furthermore, the proposed method was able to identify with high probability a large portion of the truly misclassified observations.
Conclusions
The proposed model provides a statistical tool to correct or at least attenuate the negative effects of misclassified binary responses in GWAS. Across different levels of misclassification probability as well as odds ratios of significant SNPs, the model proved to be robust. In fact, SNP effects, and misclassification probability were accurately estimated and the truly misclassified observations were identified with high probabilities compared to non-misclassified responses. This study was limited to situations where the misclassification probability was assumed to be the same in cases and controls which is not always the case based on real human disease data. Thus, it is of interest to evaluate the performance of the proposed model in that situation which is the current focus of our research.
doi:10.1186/1471-2156-14-124
PMCID: PMC3879434  PMID: 24369108
Misclassification; Genome wide association; Discrete responses

Results 1-25 (1465727)