Motivation: The heterogeneity of cancer cannot always be recognized by tumor morphology, but may be reflected by the underlying genetic aberrations. Array comparative genome hybridization (array-CGH) methods provide high-throughput data on genetic copy numbers, but determining the clinically relevant copy number changes remains a challenge. Conventional classification methods for linking recurrent alterations to clinical outcome ignore sequential correlations in selecting relevant features. Conversely, existing sequence classification methods can only model overall copy number instability, without regard to any particular position in the genome.
Results: Here, we present the heterogeneous hidden conditional random field, a new integrated array-CGH analysis method for jointly classifying tumors, inferring copy numbers and identifying clinically relevant positions in recurrent alteration regions. By capturing the sequentiality as well as the locality of changes, our integrated model provides better noise reduction, and achieves more relevant gene retrieval and more accurate classification than existing methods. We provide an efficient L1-regularized discriminative training algorithm, which notably selects a small set of candidate genes most likely to be clinically relevant and driving the recurrent amplicons of importance. Our method thus provides unbiased starting points in deciding which genomic regions and which genes in particular to pursue for further examination. Our experiments on synthetic data and real genomic cancer prediction data show that our method is superior, both in prediction accuracy and relevant feature discovery, to existing methods. We also demonstrate that it can be used to generate novel biological hypotheses for breast cancer.
Supplementary information:Supplementary data are available at Bioinformatics online.
Most existing methods for identifying aberrant regions with array CGH data are confined to a single target sample. Focusing on the comparison of multiple samples from two different groups, we develop a new penalized regression approach with a fused adaptive lasso penalty to accommodate the spatial dependence of the clones. The nonrandom aberrant genomic segments are determined by assessing the significance of the differences between neighboring clones and neighboring segments. The algorithm proposed in this article is a first attempt to simultaneously detect the common aberrant regions within each group, and the regions where the two groups differ in copy number changes. The simulation study suggests that the proposed procedure outperforms the commonly used single-sample aberration detection methods for segmentation in terms of both false positives and false negatives. To further assess the value of the proposed method, we analyze a data set from a study that identified the aberrant genomic regions associated with grade subgroups of breast cancer tumors.
Array CGH; Change point; Common aberration; Copy number aberration; Fused lasso; Median regression; Segmentation
DNA aberrations that cause colorectal cancer (CRC) occur in multiple steps that involve microsatellite instability (MSI) and chromosomal instability (CIN). Herein, we studied CRCs from AA patients for their CIN and MSI status.
Array CGH was performed on 30 AA colon tumors. The MSI status was established. The CGH data from AA were compared to published lists of 41 TSG and oncogenes in Caucasians and 68 cancer genes, proposed via systematic sequencing for somatic mutations in colon and breast tumors. The patient-by-patient CGH profiles were organized into a maximum parsimony cladogram to give insights into the tumors' aberrations lineage.
The CGH analysis revealed that CIN was independent of age, gender, stage or location. However, both the number and nature of aberrations seem to depend on the MSI status. MSI-H tumors clustered together in the cladogram. The chromosomes with the highest rates of CGH aberrations were 3, 5, 7, 8, 20 and X. Chromosome X was primarily amplified in male patients. A comparison with Caucasians revealed an overall similar aberration profile with few exceptions for the following genes; THRB, RAF1, LPL, DCC, XIST, PCNT, STS and genes on the 20q12-q13 cytoband. Among the 68 CAN genes, all showed some level of alteration in our cohort.
Chromosome X amplification in male patients with CRC merits follow-up. The observed CIN may play a distinctive role in CRC in AAs. The clustering of MSI-H tumors in global CGH data analysis suggests that chromosomal aberrations are not random.
Colorectal cancer (CRC) is a heterogeneous disease that, on the molecular level, can be characterized by inherent genomic instabilities; chromosome instability and microsatellite instability. In the present study we analyze genome-wide disruption of pre-mRNA splicing, and propose transcriptome instability as a characteristic that is analogous to genomic instability on the transcriptome level.
Exon microarray profiles from two independent series including a total of 160 CRCs were investigated for their relative amounts of exon usage differences. Each exon in each sample was assigned an alternative splicing score calculated by the FIRMA algorithm. Amounts of deviating exon usage per sample were derived from exons with extreme splicing scores.
There was great heterogeneity within both series in terms of sample-wise amounts of deviating exon usage. This was strongly associated with the expression levels of approximately half of 280 splicing factors (54% and 48% of splicing factors were significantly correlated to deviating exon usage amounts in the two series). Samples with high or low amounts of deviating exon usage, associated with overall transcriptome instability, were almost completely separated into their respective groups by hierarchical clustering analysis of splicing factor expression levels in both sample series. Samples showing a preferential tendency towards deviating exon skipping or inclusion were associated with skewed transcriptome instability. There were significant associations between transcriptome instability and reduced patient survival in both sample series. In the test series, patients with skewed transcriptome instability showed the strongest prognostic association (P = 0.001), while a combination of the two characteristics showed the strongest association with poor survival in the validation series (P = 0.03).
We have described transcriptome instability as a characteristic of CRC. This transcriptome instability has associations with splicing factor expression levels and poor patient survival.
Colorectal cancer arises as a consequence of the accumulation of genetic alterations (gene mutations, gene amplification, and so on) and epigenetic alterations (aberrant DNA methylation, chromatin modifications, and so on) that transform colonic epithelial cells into colon adenocarcinoma cells. The loss of genomic stability and resulting gene alterations are key molecular pathogenic steps that occur early in tumorigenesis; they permit the acquisition of a sufficient number of alterations in tumor suppressor genes and oncogenes that transform cells and promote tumor progression. Two predominant forms of genomic instability that have been identified in colon cancer are microsatellite instability and chromosome instability. Substantial progress has been made to identify causes of chromosomal instability in colorectal cells and to determine the effects of the different forms of genomic instability on the biological and clinical behavior of colon tumors. In addition to genomic instability, epigenetic instability results in the aberrant methylation of tumor suppressor genes. Determining the causes and roles of genomic and epigenomic instability in colon tumor formation has the potential to yield more effective prevention strategies and therapeutics for patients with colorectal cancer.
We develop a new principal components analysis (PCA) type dimension reduction method for binary data. Different from the standard PCA which is defined on the observed data, the proposed PCA is defined on the logit transform of the success probabilities of the binary observations. Sparsity is introduced to the principal component (PC) loading vectors for enhanced interpretability and more stable extraction of the principal components. Our sparse PCA is formulated as solving an optimization problem with a criterion function motivated from penalized Bernoulli likelihood. A Majorization-Minimization algorithm is developed to efficiently solve the optimization problem. The effectiveness of the proposed sparse logistic PCA method is illustrated by application to a single nucleotide polymorphism data set and a simulation study.
Binary data; Dimension reduction; MM algorithm; LASSO; PCA; Regularization; Sparsity
Copy number aberration is a common form of genomic instability in cancer. Gene expression is closely tied to cytogenetic events by the central dogma of molecular biology, and serves as a mediator of copy number changes in disease phenotypes. Accordingly, it is of interest to develop proper statistical methods for jointly analyzing copy number and gene expression data. This work describes a novel Bayesian inferential approach for a double-layered mixture model (DLMM) which directly models the stochastic nature of copy number data and identifies abnormally expressed genes due to aberrant copy number. Simulation studies were conducted to illustrate the robustness of DLMM under various settings of copy number aberration frequency, confounding effects, and signal-to-noise ratio in gene expression data. Analysis of a real breast cancer data shows that DLMM is able to identify expression changes specifically attributable to copy number aberration in tumors and that a sample-specific index built based on the selected genes is correlated with relevant clinical information.
cancer genomics; statistics
Hoffman et al.  proposed an elegant resampling method for analyzing clustered binary data. The focus of their paper was to perform association tests on clustered binary data using within-cluster-resampling (WCR) method. Follmann et al.  extended Hoffman et al.’s procedure more generally with applicability to angular data, combining of p-values, testing of vectors of parameters, and Bayesian inference. Follmann et al.  termed their procedure multiple outputation because all “excess” data within each cluster is thrown out multiple times. Herein, we refer to this procedure as WCR-MO. For any statistical test to be useful for a particular design, it must be robust, have adequate power, and be easy to implement and flexible. WCR-MO can be easily extended to continuous data and is a computationally intensive but simple and highly flexible method. Considering family as a cluster, one can apply WCR to familial data in genetic studies. Using simulations, we evaluated WCR-MO’s robustness for analysis of a continuous trait in terms of type I error rates in genetic research. WCR-MO performed well at the 5% α-level. However, it provided inflated type I error rates for α-levels less than 5% implying the procedure is liberal and may not be ready for application to genetic studies where α levels used are typically much less than 0.05.
Correlated residuals; WCR; Multiple Outputation; familial data; genetic research; Type I error
Array-based comparative genomic hybridization (aCGH) is a high-resolution high-throughput technique for studying the genetic basis of cancer. The resulting data consists of log fluorescence ratios as a function of the genomic DNA location and provides a cytogenetic representation of the relative DNA copy number variation. Analysis of such data typically involves estimation of the underlying copy number state at each location and segmenting regions of DNA with similar copy number states. Most current methods proceed by modeling a single sample/array at a time, and thus fail to borrow strength across multiple samples to infer shared regions of copy number aberrations. We propose a hierarchical Bayesian random segmentation approach for modeling aCGH data that utilizes information across arrays from a common population to yield segments of shared copy number changes. These changes characterize the underlying population and allow us to compare different population aCGH profiles to assess which regions of the genome have differential alterations. Our method, referred to as BDSAcgh (Bayesian Detection of Shared Aberrations in aCGH), is based on a unified Bayesian hierarchical model that allows us to obtain probabilities of alteration states as well as probabilities of differential alteration that correspond to local false discovery rates. We evaluate the operating characteristics of our method via simulations and an application using a lung cancer aCGH data set.
Bayesian methods; Comparative Genomic Hybridization; Copy number; Functional data analysis; Mixed Models; Mixture Models
DNA damage at the base-sequence, epigenome and chromosome level is a fundamental cause of developmental and degenerative diseases. Multiple micronutrients and their interactions with the inherited and/or acquired genome determine DNA damage and genomic instability rates. The challenge is to identify for each individual the combination of micronutrients and their doses (i.e. the nutriome) that optimises genome stability and DNA repair. In this paper I describe and propose the use of high-throughput nutrient array systems with high content analysis diagnostics of DNA damage, cell death and cell growth for defining, on an individual basis, the optimal nutriome for DNA damage prevention and cancer growth control.
Previous research has compared methods of estimation for multilevel models fit to binary data but there are reasons to believe that the results will not always generalize to the ordinal case. This paper thus evaluates (a) whether and when fitting multilevel linear models to ordinal outcome data is justified and (b) which estimator to employ when instead fitting multilevel cumulative logit models to ordinal data, Maximum Likelihood (ML) or Penalized Quasi-Likelihood (PQL). ML and PQL are compared across variations in sample size, magnitude of variance components, number of outcome categories, and distribution shape. Fitting a multilevel linear model to ordinal outcomes is shown to be inferior in virtually all circumstances. PQL performance improves markedly with the number of ordinal categories, regardless of distribution shape. In contrast to binary data, PQL often performs as well as ML when used with ordinal data. Further, the performance of PQL is typically superior to ML when the data includes a small to moderate number of clusters (i.e., ≤ 50 clusters).
Multilevel Models; Random Effects; Ordinal; Categorical; Cumulative Logit Model; Proportional Odds Model
In breast cancer, the basal-like subtype has high levels of genomic instability relative to other breast cancer subtypes with many basal-like-specific regions of aberration. There is evidence that this genomic instability extends to smaller scale genomic aberrations, as shown by a previously described micro-deletion event in the PTEN gene in the Basal-like SUM149 breast cancer cell line.
We sought to identify if small regions of genomic DNA copy number changes exist by using a high density, gene-centric Comparative Genomic Hybridizations (CGH) array on cell lines and primary tumors. A custom tiling array for CGH (244,000 probes, 200 bp tiling resolution) was created to identify small regions of genomic change, which was focused on previously identified basal-like-specific, and general cancer genes. Tumor genomic DNA from 94 patients and 2 breast cancer cell lines was labeled and hybridized to these arrays. Aberrations were called using SWITCHdna and the smallest 25% of SWITCHdna-defined genomic segments were called micro-aberrations (<64 contiguous probes, ∼ 15 kb).
Our data showed that primary tumor breast cancer genomes frequently contained many small-scale copy number gains and losses, termed micro-aberrations, most of which are undetectable using typical-density genome-wide aCGH arrays. The basal-like subtype exhibited the highest incidence of these events. These micro-aberrations sometimes altered expression of the involved gene. We confirmed the presence of the PTEN micro-amplification in SUM149 and by mRNA-seq showed that this resulted in loss of expression of all exons downstream of this event. Micro-aberrations disproportionately affected the 5′ regions of the affected genes, including the promoter region, and high frequency of micro-aberrations was associated with poor survival.
Using a high-probe-density, gene-centric aCGH microarray, we present evidence of small-scale genomic aberrations that can contribute to gene inactivation. These events may contribute to tumor formation through mechanisms not detected using conventional DNA copy number analyses.
High-risk HPV E6 and E7 oncoproteins cooperate to subvert critical host cell cycle checkpoint control mechanisms in order to promote viral genome replication. This results not only in aberrant proliferation but also in host cellular changes that can promote genomic instability. The HPV-16 E7 oncoprotein was found to induce centrosome abnormalities thereby disrupting mitotic fidelity and increasing the risk for chromosome missegregation and aneuploidy. In addition, expression of the high-risk HPV E7 oncoprotein stimulates DNA replication stress as a potential source of DNA breakage and structural chromosomal instability. Proliferation of genomically unstable cells is sustained by several mechanisms including the accelerated degradation of claspin by HPV-16 E7 and the degradation of p53 by the high-risk HPV E6 oncoprotein. These results highlight the oncogenic potential of aberrant proliferation and opens new avenues for prevention of malignant progression, not only in HPV-associated cervical cancer but also in non-virally associated malignancies with disrupted cell cycle checkpoint control mechanisms.
There is an emerging consensus that secondary structures of DNA have the potential for genomic instability. Palindromic AT-rich repeats (PATRRs) are a characteristic sequence identified at each breakpoint of the recurrent constitutional t(11;22) and t(17;22) translocations in humans, named PATRR22 (∼600 bp), PATRR11 (∼450 bp) and PATRR17 (∼190 bp). The secondary structure-forming propensity in vitro and the instability in vivo have been experimentally evaluated for various PATRRs that differ regarding their size and symmetry. At physiological ionic strength, a cruciform structure is most frequently observed for the symmetric PATRR22, less often for the symmetric PATRR11, but not for the other PATRRs. In wild-type E. coli, only these two PATRRs undergo extensive instability, consistent with the relatively high incidence of the t(11;22) in humans. The resultant deletions are putatively mediated by central cleavage by the structure-specific endonuclease SbcCD, indicating the possibility of a cruciform conformation in vivo. Insertion of a short spacer at the centre of the PATRR22 greatly reduces both its cruciform extrusion in vitro and instability in vivo. Taken together, cruciform extrusion propensity depends on the length and central symmetry of the PATRR, and is likely to determine the instability that leads to recurrent translocations in humans.
High-grade osteosarcoma is an aggressive tumor most commonly affecting adolescents. The early age of onset might suggest genetic predisposition; however, the vast majority of the tumors are sporadic. Early onset, most often lack of a predisposing condition or lesion, only infrequent (<2%) prevalence of inheritance, extensive genomic instability, and a wide histological heterogeneity are just few factors to mention that make osteosarcoma difficult to study. Therefore, it is sensible to design and use models representative of the human disease. Here we summarize multiple osteosarcoma models established in vitro and in vivo, comment on their utilities, and highlight newest achievements, such as the use of zebrafish embryos. We conclude that to gain a better understanding of osteosarcoma, simplification of this extremely complex tumor is needed. Therefore, we parse the osteosarcoma problem into parts and propose adequate models to study them each separately. A better understanding of osteosarcoma provides opportunities for discovering and assaying novel effective treatment strategies.
“Sometimes the model is more interesting than the original disease”
PJ Hoedemaeker (1937–2007).
In multi-locus association analysis, since some markers may not be associated with a trait, it seems attractive to use penalized regression with the capability of automatic variable selection. On the other hand, in spite of a rapidly growing body of literature on penalized regression, most focus on variable selection and outcome prediction, for which penalized methods are generally more effective than their non-penalized counterparts. However, for statistical inference, i.e. hypothesis testing and interval estimation, it is less clear how penalized methods would perform, or even how to best apply them, largely due to lack of studies on this topic. In our motivating data for a cohort of kidney transplant recipients, it is of primary interest to assess whether a group of genetic variants are associated with a binary clinical outcome, acute rejection at 6 months. In this paper, we study some technical issues and alternative implementations of hypothesis testing in Lasso penalized logistic regression, and compare their performance with each other and with several existing global tests, some of which are specifically designed as variance component tests for high-dimensional data. The most interesting, and perhaps surprising, conclusion of this study is that, for low to moderately high-dimensional data, statistical tests based on Lasso penalized regression are not necessarily more powerful than some existing global tests. In addition, in penalized regression, rather than building a test based on a single selected “best” model, combining multiple tests, each of which is built on a candidate model, might be more promising.
Lasso; Logistic kernel machine regression; Logistic regression; Random-effects model; Score test; Sum of squared score (SSU) test
Tumoral tissues tend to generally exhibit aberrations in DNA copy number that are associated with the development and progression of cancer. Genotyping methods such as array-based comparative genomic hybridization (aCGH) provide means to identify copy number variation across the entire genome. To address some of the shortfalls of existing methods of DNA copy number data analysis, including strong model assumptions, lack of accounting for sampling variability of estimators, and the assumption that clones are independent, we propose a simple graphical approach to assess population-level genetic alterations over the entire genome based on moving average. Furthermore, existing methods primarily focus on segmentation and do not examine the association of covariates with genetic instability. In our methods, covariates are incorporated through a possibly mis-specified working model and sampling variabilities of estimators are approximated using a resampling method that is based on perturbing observed processes. Our proposal, which is applicable to partial, entire or multiple chromosomes, is illustrated through application to aCGH studies of two brain tumor types, meningioma and glioma.
aCGH data; Moving average; Perturbation method; Gaussian process; Genomic data
Phylogenetic profiles express the presence or absence of genes and their homologs across a number of reference genomes. They have emerged as an elegant representation framework for comparative genomics and have been used for the genome-wide inference and discovery of functionally linked genes or metabolic pathways. As the number of reference genomes grows, there is an acute need for faster and more accurate methods for phylogenetic profile analysis with increased performance in speed and quality. We propose a novel, efficient method for the detection of genomic idiosyncrasies, i.e. sets of genes found in a specific genome with peculiar phylogenetic properties, such as intra-genome correlations or inter-genome relationships. Our algorithm is a four-step process where genome profiles are first defined as fuzzy vectors, then discretized to binary vectors, followed by a de-noising step, and finally a comparison step to generate intra- and inter-genome distances for each gene profile. The method is validated with a carefully selected benchmark set of five reference genomes, using a range of approaches regarding similarity metrics and pre-processing stages for noise reduction. We demonstrate that the fuzzy profile method consistently identifies the actual phylogenetic relationship and origin of the genes under consideration for the majority of the cases, while the detected outliers are found to be particular genes with peculiar phylogenetic patterns. The proposed method provides a time-efficient and highly scalable approach for phylogenetic stratification, with the detected groups of genes being either similar to their own genome profile or different from it, thus revealing atypical evolutionary histories.
Suggestions that the induction of genomic instability could play a role in radiation-induced carcinogenesis and heritable disease prompted the investigation of chromosome instability in relation to radiotherapy for childhood cancer. Chromosome analysis of peripheral blood lymphocytes at their first in vitro division was undertaken on 25 adult survivors of childhood cancer treated with radiation, 26 partners who acted as the non-irradiated control group and 43 offspring. A statistically significant increase in the frequency of dicentrics in the cancer survivor group compared with the partner control group was attributed to the residual effect of past radiation therapy. However, chromatid aberrations plus chromosome gaps, the aberrations most associated with persistent instability, were not increased. Therefore, there was no evidence that irradiation of the bone marrow had resulted in instability being transmitted to descendant cells. Frequencies of all aberration categories were significantly lower in the offspring group, compared to the partner group, apart from dicentrics for which the decrease did not reach statistical significance. The lower frequencies in the offspring provide no indication of transmissible instability being passed through the germline to the somatic cells of the offspring. Thus, in this study, genomic instability was not associated with radiotherapy in those who had received such treatment, nor was it found to be a transgenerational radiation effect.
Chromosome aberrations; Genomic instability; Radiotherapy; Carcinogenesis
Bloom's syndrome (BS) is a genetic disorder characterized cellularly by increases in sister chromatid exchanges (SCEs) and numbers of micronuclei. BS is caused by mutation in the BLM DNA helicase gene and involves a greatly enhanced risk of developing the range of malignancies seen in the general population. With a mouse model for the disease, we set out to determine the relationship between genomic instability and neoplasia. We used a novel two-step analysis to investigate a panel of eight cell lines developed from mammary tumors that appeared in Blm conditional knockout mice. First, the panel of cell lines was examined for instability. High numbers of SCEs were uniformly seen in members of the panel, and several lines produced chromosomal instability (CIN) manifested by high numbers of chromosomal structural aberrations (CAs) and chromosome missegregation events. Second, to see if Blm mutation was responsible for the CIN, time-dependent analysis was conducted on a tumor line harboring a functional floxed Blm allele. The floxed allele was deleted in vitro, and mutant as well as control subclones were cultured for 100 passages. By passage 100, six of nine mutant subclones had acquired high CIN. Nine mutant subclones produced 50-fold more CAs than did nine control subclones. Finally, chromosome loss preceded the appearance of CIN, suggesting that this loss provides a potential mechanism for the induction of instability in mutant subclones. Such aneuploidy or CIN is a universal feature of neoplasia but has an uncertain function in oncogenesis. Our results show that Blm gene mutation produces this instability, strengthening a role for CIN in the development of human cancer.
A main goal in understanding cell mechanisms is to explain the relationship among genes and related molecular processes through the combined use of technological platforms and bioinformatics analysis. High throughput platforms, such as microarrays, enable the investigation of the whole genome in a single experiment. There exist different kind of microarray platforms, that produce different types of binary data (images and raw data). Moreover, also considering a single vendor, different chips are available. The analysis of microarray data requires an initial preprocessing phase (i.e. normalization and summarization) of raw data that makes them suitable for use on existing platforms, such as the TIGR M4 Suite. Nevertheless, the annotations of data with additional information such as gene function, is needed to perform more powerful analysis. Raw data preprocessing and annotation is often performed in a manual and error prone way. Moreover, many available preprocessing tools do not support annotation. Thus novel, platform independent, and possibly open source tools enabling the semi-automatic preprocessing and annotation of microarray data are needed.
The paper presents μ-CS (Microarray Cel file Summarizer), a cross-platform tool for the automatic normalization, summarization and annotation of Affymetrix binary data. μ-CS is based on a client-server architecture. The μ-CS client is provided both as a plug-in of the TIGR M4 platform and as a Java standalone tool and enables users to read, preprocess and analyse binary microarray data, avoiding the manual invocation of external tools (e.g. the Affymetrix Power Tools), the manual loading of preprocessing libraries, and the management of intermediate files. The μ-CS server automatically updates the references to the summarization and annotation libraries that are provided to the μ-CS client before the preprocessing. The μ-CS server is based on the web services technology and can be easily extended to support more microarray vendors (e.g. Illumina).
Thus μ-CS users can directly manage binary data without worrying about locating and invoking the proper preprocessing tools and chip-specific libraries. Moreover, users of the μ-CS plugin for TM4 can manage Affymetrix binary files without using external tools, such as APT (Affymetrix Power Tools) and related libraries. Consequently, μ-CS offers four main advantages: (i) it avoids to waste time for searching the correct libraries, (ii) it reduces possible errors in the preprocessing and further analysis phases, e.g. due to the incorrect choice of parameters or the use of old libraries, (iii) it implements the annotation of preprocessed data, and finally, (iv) it may enhance the quality of further analysis since it provides the most updated annotation libraries. The μ-CS client is freely available as a plugin of the TM4 platform as well as a standalone application at the project web site (http://bioingegneria.unicz.it/M-CS).
Multiclass classification and feature (variable) selections are commonly encountered in many biological and medical applications. However, extending binary classification approaches to multiclass problems is not trivial. Instance-based methods such as the K nearest neighbor (KNN) can naturally extend to multiclass problems and usually perform well with unbalanced data, but suffer from the curse of dimensionality. Their performance is degraded when applied to high dimensional data. On the other hand, model-based methods such as logistic regression require the decomposition of the multiclass problem into several binary problems with one-vs.-one or one-vs.-rest schemes. Even though they can be applied to high dimensional data with L1 or Lp penalized methods, such approaches can only select independent features and the features selected with different binary problems are usually different. They also produce unbalanced classification problems with one vs. the rest scheme even if the original multiclass problem is balanced.
By combining instance-based and model-based learning, we propose an efficient learning method with integrated KNN and constrained logistic regression (KNNLog) for simultaneous multiclass classification and feature selection. Our proposed method simultaneously minimizes the intra-class distance and maximizes the interclass distance with fewer estimated parameters. It is very efficient for problems with small sample size and unbalanced classes, a case common in many real applications. In addition, our model-based feature selection methods can identify highly correlated features simultaneously avoiding the multiplicity problem due to multiple tests. The proposed method is evaluated with simulation and real data including one unbalanced microRNA dataset for leukemia and one multiclass metagenomic dataset from the Human Microbiome Project (HMP). It performs well with limited computational experiments.
feature selection; multiclass classification; statistical learning; high-dimensional data
Copy number aberrations (CNAs) are an important molecular signature in cancer initiation, development, and progression. However, these aberrations span a wide range of chromosomes, making it hard to distinguish cancer related genes from other genes that are not closely related to cancer but are located in broadly aberrant regions. With the current availability of high-resolution data sets such as single nucleotide polymorphism (SNP) microarrays, it has become an important issue to develop a computational method to detect driving genes related to cancer development located in the focal regions of CNAs.
In this study, we introduce a novel method referred to as the wavelet-based identification of focal genomic aberrations (WIFA). The use of the wavelet analysis, because it is a multi-resolution approach, makes it possible to effectively identify focal genomic aberrations in broadly aberrant regions. The proposed method integrates multiple cancer samples so that it enables the detection of the consistent aberrations across multiple samples. We then apply this method to glioblastoma multiforme and lung cancer data sets from the SNP microarray platform. Through this process, we confirm the ability to detect previously known cancer related genes from both cancer types with high accuracy. Also, the application of this approach to a lung cancer data set identifies focal amplification regions that contain known oncogenes, though these regions are not reported using a recent CNAs detecting algorithm GISTIC: SMAD7 (chr18q21.1) and FGF10 (chr5p12).
Our results suggest that WIFA can be used to reveal cancer related genes in various cancer data sets.
Motivation: Components of biological systems interact with each other in order to carry out vital cell functions. Such information can be used to improve estimation and inference, and to obtain better insights into the underlying cellular mechanisms. Discovering regulatory interactions among genes is therefore an important problem in systems biology. Whole-genome expression data over time provides an opportunity to determine how the expression levels of genes are affected by changes in transcription levels of other genes, and can therefore be used to discover regulatory interactions among genes.
Results: In this article, we propose a novel penalization method, called truncating lasso, for estimation of causal relationships from time-course gene expression data. The proposed penalty can correctly determine the order of the underlying time series, and improves the performance of the lasso-type estimators. Moreover, the resulting estimate provides information on the time lag between activation of transcription factors and their effects on regulated genes. We provide an efficient algorithm for estimation of model parameters, and show that the proposed method can consistently discover causal relationships in the large p, small n setting. The performance of the proposed model is evaluated favorably in simulated, as well as real, data examples.
Availability: The proposed truncating lasso method is implemented in the R-package ‘grangerTlasso’ and is freely available at http://www.stat.lsa.umich.edu/∼shojaie/
Motivation: Genomic instability is one of the fundamental factors in tumorigenesis and tumor progression. Many studies have shown that copy-number abnormalities at the DNA level are important in the pathogenesis of cancer. Array comparative genomic hybridization (aCGH), developed based on expression microarray technology, can reveal the chromosomal aberrations in segmental copies at a high resolution. However, due to the nature of aCGH, many standard expression data processing tools, such as data normalization, often fail to yield satisfactory results.
Results: We demonstrated a novel aCGH normalization algorithm, which provides an accurate aCGH data normalization by utilizing the dependency of neighboring probe measurements in aCGH experiments. To facilitate the study, we have developed a hidden Markov model (HMM) to simulate a series of aCGH experiments with random DNA copy number alterations that are used to validate the performance of our normalization. In addition, we applied the proposed normalization algorithm to an aCGH study of lung cancer cell lines. By using the proposed algorithm, data quality and the reliability of experimental results are significantly improved, and the distinct patterns of DNA copy number alternations are observed among those lung cancer cell lines.
Supplementary information: Source codes and.gures may be found at http://ntumaps.cgm.ntu.edu.tw/aCGH_supplementary