With the advent of high-throughput technologies for measuring genome-wide expression profiles, a large number of methods have been proposed for discovering diagnostic markers that can accurately discriminate between different classes of a disease. However, factors such as the small sample size of typical clinical data, the inherent noise in high-throughput measurements, and the heterogeneity across different samples, often make it difficult to find reliable gene markers. To overcome this problem, several studies have proposed the use of pathway-based markers, instead of individual gene markers, for building the classifier. Given a set of known pathways, these methods estimate the activity level of each pathway by summarizing the expression values of its member genes, and use the pathway activities for classification. It has been shown that pathway-based classifiers typically yield more reliable results compared to traditional gene-based classifiers. In this paper, we propose a new classification method based on probabilistic inference of pathway activities. For a given sample, we compute the log-likelihood ratio between different disease phenotypes based on the expression level of each gene. The activity of a given pathway is then inferred by combining the log-likelihood ratios of the constituent genes. We apply the proposed method to the classification of breast cancer metastasis, and show that it achieves higher accuracy and identifies more reproducible pathway markers compared to several existing pathway activity inference methods.
The versatility of DNA copy number amplifications for profiling and categorization of various tissue samples has been widely acknowledged in the biomedical literature. For instance, this type of measurement techniques provides possibilities for exploring sets of cancerous tissues to identify novel subtypes. The previously utilized statistical approaches to various kinds of analyses include traditional algorithmic techniques for clustering and dimension reduction, such as independent and principal component analyses, hierarchical clustering, as well as model-based clustering using maximum likelihood estimation for latent class models.
While purely algorithmic methods are usually easily applicable, their suboptimal performance and limitations in making formal inference have been thoroughly discussed in the statistical literature. Here we introduce a Bayesian model-based approach to simultaneous identification of underlying tissue groups and the informative amplifications. The model-based approach provides the possibility of using formal inference to determine the number of groups from the data, in contrast to the ad hoc methods often exploited for similar purposes. The model also automatically recognizes the chromosomal areas that are relevant for the clustering.
Validatory analyses of simulated data and a large database of DNA copy number amplifications in human neoplasms are used to illustrate the potential of our approach. Our software implementation BASTA for performing Bayesian statistical tissue profiling is freely available for academic purposes at
Under-age drinking is a long-standing public health problem in the USA and the identification of underage drinkers suffering alcohol-related problems has been difficult by using diagnostic criteria that were developed in adult populations. For this reason, it is important to characterize patterns of drinking in adolescents that are associated with alcohol-related problems. Latent class analysis is a statistical technique for explaining heterogeneity in individual response patterns in terms of a smaller number of classes. However, the latent class analysis assumption of local independence may not be appropriate when examining behavioural profiles and could have implications for statistical inference. In addition, if covariates are included in the model, non-differential measurement is also assumed. We propose a flexible set of models for local dependence and differential measurement that use easily interpretable odds ratio parameterizations while simultaneously fitting a marginal regression model for the latent class prevalences. Estimation is based on solving a set of second-order estimating equations. This approach requires only specification of the first two moments and allows for the choice of simple ‘working’ covariance structures. The method is illustrated by using data from a large-scale survey of under-age drinking. This new approach indicates the effectiveness of introducing local dependence and differential measurement into latent class models for selecting substantively interpretable models over more complex models that are deemed empirically superior.
Differential measurement; Latent class; Local dependence; Marginal regression; Odds ratio; Second-order estimating equations
Margin-based classifiers have been popular in both machine learning and statistics for classification problems. Among numerous classifiers, some are hard classifiers while some are soft ones. Soft classifiers explicitly estimate the class conditional probabilities and then perform classification based on estimated probabilities. In contrast, hard classifiers directly target on the classification decision boundary without producing the probability estimation. These two types of classifiers are based on different philosophies and each has its own merits. In this paper, we propose a novel family of large-margin classifiers, namely large-margin unified machines (LUMs), which covers a broad range of margin-based classifiers including both hard and soft ones. By offering a natural bridge from soft to hard classification, the LUM provides a unified algorithm to fit various classifiers and hence a convenient platform to compare hard and soft classification. Both theoretical consistency and numerical performance of LUMs are explored. Our numerical study sheds some light on the choice between hard and soft classifiers in various classification problems.
Class Probability Estimation; DWD; Fisher Consistency; Regularization; SVM
Latent class models (LCMs) are used increasingly for addressing a broad variety of problems, including sparse modeling of multivariate and longitudinal data, model-based clustering, and flexible inferences on predictor effects. Typical frequentist LCMs require estimation of a single finite number of classes, which does not increase with the sample size, and have a well-known sensitivity to parametric assumptions on the distributions within a class. Bayesian nonparametric methods have been developed to allow an infinite number of classes in the general population, with the number represented in a sample increasing with sample size. In this article, we propose a new nonparametric Bayes model that allows predictors to flexibly impact the allocation to latent classes, while limiting sensitivity to parametric assumptions by allowing class-specific distributions to be unknown subject to a stochastic ordering constraint. An efficient MCMC algorithm is developed for posterior computation. The methods are validated using simulation studies and applied to the problem of ranking medical procedures in terms of the distribution of patient morbidity.
Factor analysis; Latent variables; Mixture model; Model-based clustering; Nested Dirichlet process; Order restriction; Random probability measure; Stick breaking
One of the most important applications of microarray data is the class prediction of biological samples. For this purpose, statistical tests have often been applied to identify the differentially expressed genes (DEGs), followed by the employment of the state-of-the-art learning machines including the Support Vector Machines (SVM) in particular. The SVM is a typical sample-based classifier whose performance comes down to how discriminant samples are. However, DEGs identified by statistical tests are not guaranteed to result in a training dataset composed of discriminant samples. To tackle this problem, a novel gene ranking method namely the Kernel Matrix Gene Selection (KMGS) is proposed. The rationale of the method, which roots in the fundamental ideas of the SVM algorithm, is described. The notion of ''the separability of a sample'' which is estimated by performing -like statistics on each column of the kernel matrix, is first introduced. The separability of a classification problem is then measured, from which the significance of a specific gene is deduced. Also described is a method of Kernel Matrix Sequential Forward Selection (KMSFS) which shares the KMGS method's essential ideas but proceeds in a greedy manner. On three public microarray datasets, our proposed algorithms achieved noticeably competitive performance in terms of the B.632+ error rate.
Obsessive-compulsive disorder (OCD) is phenomenologically heterogeneous, and findings of underlying structure classification based on symptom grouping have been ambiguous to date. Variable-centered approaches, primarily factor analysis, have been used to identify homogeneous groups of symptoms, but person-centered latent methods have seen little use. This study was designed to uncover sets of homogeneous groupings within 1611 individuals with OCD, based on symptoms.
Latent class analysis (LCA) models using 61 obsessive-compulsive symptoms (OCS) collected from the Yale-Brown Obsessive-Compulsive Scale were fit. Relationships between latent class membership and treatment response, gender, symptom severity and comorbid tic disorders were tested for relationship to class membership.
LCA models of best fit yielded three classes. Classes differed only in frequency of symptom endorsement. Classes with higher symptom endorsement were associated with earlier age of onset, being male, higher YBOCS symptom severity scores, and comorbid tic disorders. There were no differences in treatment response between classes.
These results provide support for the validity of a single underlying latent OCD construct, in addition to the distinct symptom factors identified previously via factor analyses.
obsessions; compulsions; latent class
Classification is a very useful statistical tool for information extraction. In particular, multicategory classification is commonly seen in various applications. Although binary classification problems are heavily studied, extensions to the multicategory case are much less so. In view of the increased complexity and volume of modern statistical problems, it is desirable to have multicategory classifiers that are able to handle problems with high dimensions and with a large number of classes. Moreover, it is necessary to have sound theoretical properties for the multicategory classifiers. In the literature, there exist several different versions of simultaneous multicategory Support Vector Machines (SVMs). However, the computation of the SVM can be difficult for large scale problems, especially for problems with large number of classes. Furthermore, the SVM cannot produce class probability estimation directly. In this article, we propose a novel efficient multicategory composite least squares classifier (CLS classifier), which utilizes a new composite squared loss function. The proposed CLS classifier has several important merits: efficient computation for problems with large number of classes, asymptotic consistency, ability to handle high dimensional data, and simple conditional class probability estimation. Our simulated and real examples demonstrate competitive performance of the proposed approach.
Fisher Consistency; Multicategory Classification; Regression; Squared loss; Support Vector Machine
We propose a novel learning method that combines multiple experimental modalities to improve the MHC Class-I binding prediction. Multiple experimental modalities are often accessible in the context of a binding problem. Such modalities can provide different labels of data, such as binary classifications, affinity measurements, or direct estimations of the binding profile. Current machine learning algorithms usually focus on a given label type. We here present a novel Multi-Label Vector Optimization (MLVO) formalism to produce classifiers based on the simultaneous optimization of multiple labels. Within this methodology, all label types are combined into a single constrained quadratic dual optimization problem.
We apply the MLVO to MHC class-I epitope prediction. We combine affinity measurements (IC50/EC50), binary classifications of epitopes as T cell activators and existing algorithms. The multi-label vector optimization algorithms produce classifiers significantly better than the ones resulting from any of its components. These matrix based classifier are better or equivalent to the existing state of the art MHC-I epitope prediction tools in the studied alleles.
A range of psychiatric symptoms and cognitive deficits occur in Parkinson’s disease (PD), and symptom overlap and co-morbidity complicate the classification of non-motor symptoms. The objective of this study was to use analytic-based approaches to classify psychiatric and cognitive symptoms in PD.
Cross-sectional evaluation of a convenience sample of patients in specialty care.
Two outpatient movement disorders centers at the University of Pennsylvania and Philadelphia Veterans Affairs Medical Center.
177 patients with mild-moderate idiopathic PD and without significant global cognitive impairment.
Subjects were assessed with an extensive psychiatric, neuropsychological, and neurological battery. Latent class analysis (LCA) was used to statistically delineate group-level symptom profiles across measures of psychiatric and cognitive functioning. Predictors of class membership were also examined.
Results from the LCA indicated that a four-class solution best fit the data. 32.3% of the sample had good psychiatric and normal cognitive functioning, 17.5% had significant psychiatric co-morbidity but normal cognition, 26.0% had few psychiatric symptoms but had poorer cognitive functioning across a range of cognitive domains, and 24.3% had both significant psychiatric co-morbidity and poorer cognitive functioning. Age, disease severity, and medication use predicted class membership.
LCA delineates four classes of patients in mild-moderate PD, three of which experience significant non-motor impairments and comprise over two-thirds of patients. Neuropsychiatric symptoms and cognitive deficits follow distinct patterns in PD, and further study is needed to determine if these classes are generalizable, stable, predict function, quality of life and long-term outcomes, and are amenable to treatment at a class level.
Parkinson’s disease; neuropsychology; psychiatry; cognition; depression
Multiclass classification of microarray data samples with a reduced number of genes is a rich and challenging problem in Bioinformatics research. The problem gets harder as the number of classes is increased. In addition, the performance of most classifiers is tightly linked to the effectiveness of mandatory gene selection methods. Critical to gene selection is the availability of estimates about the maximum number of genes that can be handled by any classification algorithm. Lack of such estimates may lead to either computationally demanding explorations of a search space with thousands of dimensions or classification models based on gene sets of unrestricted size. In the former case, unbiased but possibly overfitted classification models may arise. In the latter case, biased classification models unable to support statistically significant findings may be obtained.
A novel bound on the maximum number of genes that can be handled by binary classifiers in binary mediated multiclass classification algorithms of microarray data samples is presented. The bound suggests that high-dimensional binary output domains might favor the existence of accurate and sparse binary mediated multiclass classifiers for microarray data samples.
A comprehensive experimental work shows that the bound is indeed useful to induce accurate and sparse multiclass classifiers for microarray data samples.
Bioinformatics data analysis often deals with additive mixtures of signals for which only class labels are known. Then, the overall goal is to estimate class related signals for data mining purposes. A convenient application is metabolic monitoring of patients using infrared spectroscopy. Within an infrared spectrum each single compound contributes quantitatively to the measurement.
In this work, we propose a novel factorization technique for additive signal factorization that allows learning from classified samples. We define a composed loss function for this task and analytically derive a closed form equation such that training a model reduces to searching for an optimal threshold vector. Our experiments, carried out on synthetic and clinical data, show a sensitivity of up to 0.958 and specificity of up to 0.841 for a 15-class problem of disease classification. Using class and regression information in parallel, our algorithm outperforms linear SVM for training cases having many classes and few data.
The presented factorization method provides a simple and generative model and, therefore, represents a first step towards predictive factorization methods.
Study objective: To examine social inequalities in minor psychiatric morbidity as measured by the GHQ-12 using lagged models of psychiatric morbidity and changing job status.
Design: GHQ scores were modelled using two level hierarchical regression models with measurement occasions nested within individuals. The paper compares and contrasts three different ways of describing social position: income, social advantage and lifestyle (the Cambridge scale), and social class (the new National Statistics Socio-Economic Classification), and adjusts for attrition.
Setting: Survey interviews for a nationally representative sample of adults of working age living in Britain.
Participants: 8091 original adult respondents in 1991 who remain of working age during 1991–1998 from the British Household Panel Survey (BHPS).
Main results: There was a relation of GHQ-12 to social position when social position was combined with employment status. This relation itself varied according to a person's psychological health in the previous year.
Conclusions: The relation between social position and minor psychiatric morbidity depended on whether or not a person was employed, unemployed, or economically inactive. It was stronger in those with previously less good psychological health. Among employed men and women in good health, GHQ-12 varied little according to social class, status, or income. There was a "classic" social gradient in psychiatric morbidity, with worse health in less advantaged groups, among the economically inactive. Among the unemployed, a "reverse" gradient was found: the impact of unemployment on minor psychiatric morbidity was higher for those who were previously in a more advantaged social class position.
Uncertainty often affects molecular biology experiments and data for different reasons. Heterogeneity of gene or protein expression within the same tumor tissue is an example of biological uncertainty which should be taken into account when molecular markers are used in decision making. Tissue Microarray (TMA) experiments allow for large scale profiling of tissue biopsies, investigating protein patterns characterizing specific disease states. TMA studies deal with multiple sampling of the same patient, and therefore with multiple measurements of same protein target, to account for possible biological heterogeneity. The aim of this paper is to provide and validate a classification model taking into consideration the uncertainty associated with measuring replicate samples.
We propose an extension of the well-known Naïve Bayes classifier, which accounts for biological heterogeneity in a probabilistic framework, relying on Bayesian hierarchical models. The model, which can be efficiently learned from the training dataset, exploits a closed-form of classification equation, thus providing no additional computational cost with respect to the standard Naïve Bayes classifier. We validated the approach on several simulated datasets comparing its performances with the Naïve Bayes classifier. Moreover, we demonstrated that explicitly dealing with heterogeneity can improve classification accuracy on a TMA prostate cancer dataset.
The proposed Hierarchical Naïve Bayes classifier can be conveniently applied in problems where within sample heterogeneity must be taken into account, such as TMA experiments and biological contexts where several measurements (replicates) are available for the same biological sample. The performance of the new approach is better than the standard Naïve Bayes model, in particular when the within sample heterogeneity is different in the different classes.
Mitochondria exist as a network of interconnected organelles undergoing constant fission and fusion. Current approaches to study mitochondrial morphology are limited by low data sampling coupled with manual identification and classification of complex morphological phenotypes. Here we propose an integrated mechanistic and data-driven modeling approach to analyze heterogeneous, quantified datasets and infer relations between mitochondrial morphology and apoptotic events. We initially performed high-content, multi-parametric measurements of mitochondrial morphological, apoptotic, and energetic states by high-resolution imaging of human breast carcinoma MCF-7 cells. Subsequently, decision tree-based analysis was used to automatically classify networked, fragmented, and swollen mitochondrial subpopulations, at the single-cell level and within cell populations. Our results revealed subtle but significant differences in morphology class distributions in response to various apoptotic stimuli. Furthermore, key mitochondrial functional parameters including mitochondrial membrane potential and Bax activation, were measured under matched conditions. Data-driven fuzzy logic modeling was used to explore the non-linear relationships between mitochondrial morphology and apoptotic signaling, combining morphological and functional data as a single model. Modeling results are in accordance with previous studies, where Bax regulates mitochondrial fragmentation, and mitochondrial morphology influences mitochondrial membrane potential. In summary, we established and validated a platform for mitochondrial morphological and functional analysis that can be readily extended with additional datasets. We further discuss the benefits of a flexible systematic approach for elucidating specific and general relationships between mitochondrial morphology and apoptosis.
The advent of microarray technology has made it possible to classify disease states based on gene expression profiles of patients. Typically, marker genes are selected by measuring the power of their expression profiles to discriminate among patients of different disease states. However, expression-based classification can be challenging in complex diseases due to factors such as cellular heterogeneity within a tissue sample and genetic heterogeneity across patients. A promising technique for coping with these challenges is to incorporate pathway information into the disease classification procedure in order to classify disease based on the activity of entire signaling pathways or protein complexes rather than on the expression levels of individual genes or proteins. We propose a new classification method based on pathway activities inferred for each patient. For each pathway, an activity level is summarized from the gene expression levels of its condition-responsive genes (CORGs), defined as the subset of genes in the pathway whose combined expression delivers optimal discriminative power for the disease phenotype. We show that classifiers using pathway activity achieve better performance than classifiers based on individual gene expression, for both simple and complex case-control studies including differentiation of perturbed from non-perturbed cells and subtyping of several different kinds of cancer. Moreover, the new method outperforms several previous approaches that use a static (i.e., non-conditional) definition of pathways. Within a pathway, the identified CORGs may facilitate the development of better diagnostic markers and the discovery of core alterations in human disease.
The advent of microarray technology has drawn immense interest to identify gene expression levels that can serve as biomarkers for disease. Marker genes are selected by examining each individual gene to see how well its expression level discriminates different disease types. In complex diseases such as cancer, good marker genes can be hard to find due to cellular heterogeneity within the tissue and genetic heterogeneity across patients. A promising technique for addressing these challenges is to incorporate biological pathway information into the marker identification procedure, permitting disease classification based on the activity of entire pathways rather than simply on the expression levels of individual genes. However, previous pathway-based methods have not significantly outperformed gene-based methods. Here, we propose a new pathway-based classification procedure in which markers are encoded not as individual genes, nor as the set of genes making up a known pathway, but as subsets of “condition-responsive genes (CORGs)” within those pathways. Using expression profiles from seven different microarray studies, we show that the accuracy of this method is significantly better than both the conventional gene- and pathway- based diagnostics. Furthermore, the identified CORGs may facilitate the development of effective diagnostic markers and the discovery of molecular mechanisms underlying disease.
The launch of the 5th version of the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) has sparked a debate about the current approach to psychiatric classification. The most basic and enduring problem of the DSM is that its classifications are heterogeneous clinical descriptions rather than valid diagnoses, which hampers scientific progress. Therefore, more homogeneous evidence-based diagnostic entities should be developed. To this end, data-driven techniques, such as latent class- and factor analyses, have already been widely applied. However, these techniques are insufficient to account for all relevant levels of heterogeneity, among real-life individuals. There is heterogeneity across persons (p:for example, subgroups), across symptoms (s:for example, symptom dimensions) and over time (t:for example, course-trajectories) and these cannot be regarded separately. Psychiatry should upgrade to techniques that can analyze multi-mode (p-by-s-by-t) data and can incorporate all of these levels at the same time to identify optimal homogeneous subgroups (for example, groups with similar profiles/connectivity of symptomatology and similar course). For these purposes, Multimode Principal Component Analysis and (Mixture)-Graphical Modeling may be promising techniques.
DSM-5; Heterogeneity; Data-driven techniques; Cattell’s cube
The purpose of the study was to explore heterogeneity and differential treatment outcome among a sample of patients with binge eating disorder (BED).
A latent class analysis was conducted with 205 treatment-seeking, overweight or obese individuals with BED randomized to Interpersonal Psychotherapy (IPT), Behavioral Weight Loss (BWL), or guided self-help based on Cognitive Behavioral Therapy (CBTgsh). A latent transition analysis tested the predictive validity of the latent class analysis model.
A 4-class model yielded the best overall fit to the data. Class 1 was characterized by a lower mean body mass index (BMI) and increased physical activity. Individuals in class 2 reported the most binge eating, shape and weight concerns, compensatory behaviors, and negative affect. Class 3 patients reported similar binge eating frequencies to class 2 with lower levels of exercise or compensation. Class 4 was characterized by the highest average BMI, the most overeating episodes, fewer binge episodes, and an absence of compensatory behaviors. Classes 1 and 3 had the highest and lowest percentage of individuals with a past eating disorder diagnosis, respectively. The latent transition analysis found a higher probability of remission from binge eating among those receiving IPT in Class 2 and CBTgsh in Class 3.
The latent class analysis identified four distinct classes using baseline measures of eating disorder and depressive symptoms, body weight, and physical activity. Implications of the observed differential treatment response are discussed.
binge eating disorder; latent class analysis; latent transition analysis; treatment specificity
Non ignorable missing data is a common problem in longitudinal studies. Latent class models are attractive for simplifying the modeling of missing data when the data are subject to either a monotone or intermittent missing data pattern. In our study, we propose a new two-latent-class model for categorical data with informative dropouts, dividing the observed data into two latent classes; one class in which the outcomes are deterministic and a second one in which the outcomes can be modeled using logistic regression. In the model, the latent classes connect the longitudinal responses and the missingness process under the assumption of conditional independence. Parameters are estimated by the method of maximum likelihood estimation based on the above assumptions and the tetrachoric correlation between responses within the same subject. We compare the proposed method with the shared parameter model and the weighted GEE model using the areas under the ROC curves in the simulations and the application to the smoking cessation data set. The simulation results indicate that the proposed two-latent-class model performs well under different missing procedures. The application results show that our proposed method is better than the shared parameter model and the weighted GEE model.
Area under ROC curve; Informative dropout; Latent class; Tetrachoric correlation
Due to the daily mass production and the widespread variation of medical X-ray images, it is necessary to classify these for searching and retrieving proposes, especially for content-based medical image retrieval systems. In this paper, a medical X-ray image hierarchical classification structure based on a novel merging and splitting scheme and using shape and texture features is proposed. In the first level of the proposed structure, to improve the classification performance, similar classes with regard to shape contents are grouped based on merging measures and shape features into the general overlapped classes. In the next levels of this structure, the overlapped classes split in smaller classes based on the classification performance of combination of shape and texture features or texture features only. Ultimately, in the last levels, this procedure is also continued forming all the classes, separately. Moreover, to optimize the feature vector in the proposed structure, we use orthogonal forward selection algorithm according to Mahalanobis class separability measure as a feature selection and reduction algorithm. In other words, according to the complexity and inter-class distance of each class, a sub-space of the feature space is selected in each level and then a supervised merging and splitting scheme is applied to form the hierarchical classification. The proposed structure is evaluated on a database consisting of 2158 medical X-ray images of 18 classes (IMAGECLEF 2005 database) and accuracy rate of 93.6% in the last level of the hierarchical structure for an 18-class classification problem is obtained.
Hierarchical classification; merging and splitting scheme; orthogonal forward selection; shape and texture features
In attention/hyperactivity and aggressive behavior problems were measured in 335 children from school entry throughout adolescence, at 3-year intervals. Children were participants in a high-risk prospective study of substance use disorders and comorbid problems. A parallel process latent growth model found aggressive behavior decreasing throughout childhood and adolescence, whereas inattentive/hyperactive behavior levels were constant. Growth mixture modeling, in which developmental trajectories are statistically classified, found two classes for inattention/hyperactivity and two for aggressive behavior, resulting in a total of four trajectory classes. Different influences of the family environment predicted development of the two types of behavior problems when the other behavior problem was held constant. Lower emotional support and lower intellectual stimulation by the parents in early childhood predicted membership in the high problem class of inattention/hyperactivity when the trajectory of aggression was held constant. Conversely, conflict and lack of cohesiveness in the family environment predicted membership in a worse developmental trajectory of aggressive behavior when the inattention/hyperactivity trajectories were held constant. The implications of these findings for the development of inattention/hyperactivity and for the development of risk for the emergence of substance use disorders are discussed.
This study aimed to investigate empirically how and in what way individuals with symptoms of functional somatic syndromes should be classified. We also aimed to look into genetic and environmental influences on the classification.
A total of 28531 twins aged 41–64 underwent screening interviews via a computer-assisted data collection system from 1998 to 2002. Nine functional somatic symptoms (abnormal tiredness, general muscular pain, recurrent abdominal discomfort, back pain, gastroesophageal reflux, recurrent headache, recurrent urinary problem, dizziness, breathlessness at rest) were assessed using structured questions in a blinded manner. Latent class analysis was applied to the data. Structural equation modeling was further performed in order to estimate the relative importance of genetic and environmental influences on class probability.
Latent class analysis resulted in a 5-class solution. Individuals in the first class did not show any health problems. Those assigned to the second, third, and fourth classes tended to have abnormal tiredness, gastrointestinal problems, and pain-related symptoms, respectively. Individuals in the fifth class had multiple symptoms to a greater extent than the other classes. All the five classes showed modest genetic influences (7 – 29% of the total variation) with gender differences except Class 3; however, the majority of influences on the class membership derived from unique environmental effects.
The findings suggested the necessity of re-defining the existing classification criteria for functional somatic syndromes in terms of single (uncomplicated) or multiple (complicated) syndromes. Environmental influences are important for the aetiology of functional somatic syndromes.
functional somatic syndromes; chronic fatigue syndrome; chronic widespread pain; irritable bowel syndrome; comorbidity; latent class analysis
Very few analytical approaches have been reported to resolve the variability in microarray measurements stemming from sample heterogeneity. For example, tissue samples used in cancer studies are usually contaminated with the surrounding or infiltrating cell types. This heterogeneity in the sample preparation hinders further statistical analysis, significantly so if different samples contain different proportions of these cell types. Thus, sample heterogeneity can result in the identification of differentially expressed genes that may be unrelated to the biological question being studied. Similarly, irrelevant gene combinations can be discovered in the case of gene expression based classification.
We propose a computational framework for removing the effects of sample heterogeneity by "microdissecting" microarray data in silico. The computational method provides estimates of the expression values of the pure (non-heterogeneous) cell samples. The inversion of the sample heterogeneity can be facilitated by providing accurate estimates of the mixing percentages of different cell types in each measurement. For those cases where no such information is available, we develop an optimization-based method for joint estimation of the mixing percentages and the expression values of the pure cell samples. We also consider the problem of selecting the correct number of cell types.
The efficiency of the proposed methods is illustrated by applying them to a carefully controlled cDNA microarray data obtained from heterogeneous samples. The results demonstrate that the methods are capable of reconstructing both the sample and cell type specific expression values from heterogeneous mixtures and that the mixing percentages of different cell types can also be estimated. Furthermore, a general purpose model selection method can be used to select the correct number of cell types.
Case selection is a useful approach for increasing the efficiency and performance of case-based classifiers. Multiple techniques have been designed to perform case selection. This paper empirically investigates how class imbalance in the available set of training cases can impact the performance of the resulting classifier as well as properties of the selected set. In this study, the experiments are performed using a dataset for the problem of detecting breast masses in screening mammograms. The classification problem was binary and we used a k-nearest neighbor classifier. The classifier’s performance was evaluated using the Receiver Operating Characteristic (ROC) area under the curve (AUC) measure. The experimental results indicate that although class imbalance reduces the performance of the derived classifier and the effectiveness of selection at improving overall classifier performance, case selection can still be beneficial, regardless of the level of class imbalance.
Case-based reasoning; Computer-aided decision; Class imbalance; Case selection
An important statistical problem associated with tumor classification is the identification of new tumor classes using gene-expression profiles. One aspect of this clustering problem is estimating the number of clusters in a dataset. A new prediction-based resampling method, Clest, was developed to address this problem.
Microarray technology is increasingly being applied in biological and medical research to address a wide range of problems, such as the classification of tumors. An important statistical problem associated with tumor classification is the identification of new tumor classes using gene-expression profiles. Two essential aspects of this clustering problem are: to estimate the number of clusters, if any, in a dataset; and to allocate tumor samples to these clusters, and assess the confidence of cluster assignments for individual samples. Here we address the first of these problems.
We have developed a new prediction-based resampling method, Clest, to estimate the number of clusters in a dataset. The performance of the new and existing methods were compared using simulated data and gene-expression data from four recently published cancer microarray studies. Clest was generally found to be more accurate and robust than the six existing methods considered in the study.
Focusing on prediction accuracy in conjunction with resampling produces accurate and robust estimates of the number of clusters.