A typical small-sample biomarker classification paper discriminates between types of pathology based on, say, 30,000 genes and a small labeled sample of less than 100 points. Some classification rule is used to design the classifier from this data, but we are given no good reason or conditions under which this algorithm should perform well. An error estimation rule is used to estimate the classification error on the population using the same data, but once again we are given no good reason or conditions under which this error estimator should produce a good estimate, and thus we do not know how well the classifier should be expected to perform. In fact, virtually, in all such papers the error estimate is expected to be highly inaccurate. In short, we are given no justification for any claims.
Given the ubiquity of vacuous small-sample classification papers in the literature, one could easily conclude that scientific knowledge is impossible in small-sample settings. It is not that thousands of papers overtly claim that scientific knowledge is impossible in regard to their content; rather, it is that they utilize methods that preclude scientific knowledge. In this paper, we argue to the contrary that scientific knowledge in small-sample classification is possible provided there is sufficient prior knowledge. A natural way to proceed, discussed herein, is via a paradigm for pattern recognition in which we incorporate prior knowledge in the whole classification procedure (classifier design and error estimation), optimize each step of the procedure given available information, and obtain theoretical measures of performance for both classifiers and error estimators, the latter being the critical epistemological issue. In sum, we can achieve scientific validation for a proposed small-sample classifier and its error estimate.
Scientific knowledge is grounded in a particular epistemology and, owing to the requirements of that epistemology, possesses limitations. Some limitations are intrinsic, in the sense that they depend inherently on the nature of scientific knowledge; others are contingent, depending on the present state of knowledge, including technology. Understanding limitations facilitates scientific research because one can then recognize when one is confronted by a limitation, as opposed to simply being unable to solve a problem within the existing bounds of possibility. In the hope that the role of limiting factors can be brought more clearly into focus and discussed, we consider several sources of limitation as they apply to biological knowledge: mathematical complexity, experimental constraints, validation, knowledge discovery, and human intellectual capacity.
Complexity; Gene regulatory networks; Epistemology; Experimental design; Genomics; Knowledge discovery; Modeling; Validation.
Time-course expression profiles and methods for spectrum analysis have been applied for detecting transcriptional periodicities, which are valuable patterns to unravel genes associated with cell cycle and circadian rhythm regulation. However, most of the proposed methods suffer from restrictions and large false positives to a certain extent. Additionally, in some experiments, arbitrarily irregular sampling times as well as the presence of high noise and small sample sizes make accurate detection a challenging task. A novel scheme for detecting periodicities in time-course expression data is proposed, in which a real-valued iterative adaptive approach (RIAA), originally proposed for signal processing, is applied for periodogram estimation. The inferred spectrum is then analyzed using Fisher's hypothesis test. With a proper p-value threshold, periodic genes can be detected. A periodic signal, two nonperiodic signals, and four sampling strategies were considered in the simulations, including both bursts and drops. In addition, two yeast real datasets were applied for validation. The simulations and real data analysis reveal that RIAA can perform competitively with the existing algorithms. The advantage of RIAA is manifested when the expression data are highly irregularly sampled, and when the number of cycles covered by the sampling time points is very reduced.
Motivation: Peptide detection is a crucial step in mass spectrometry (MS) based proteomics. Most existing algorithms are based upon greedy isotope template matching and thus may be prone to error propagation and ineffective to detect overlapping peptides. In addition, existing algorithms usually work at different charge states separately, isolating useful information that can be drawn from other charge states, which may lead to poor detection of low abundance peptides.
Results: BPDA2d models spectra as a mixture of candidate peptide signals and systematically evaluates all possible combinations of possible peptide candidates to interpret the given spectra. For each candidate, BPDA2d takes into account its elution profile, charge state distribution and isotope pattern, and it combines all evidence to infer the candidate's signal and existence probability. By piecing all evidence together—especially by deriving information across charge states—low abundance peptides can be better identified and peptide detection rates can be improved. Instead of local template matching, BPDA2d performs global optimization for all candidates and systematically optimizes their signals. Since BPDA2d looks for the optimal among all possible interpretations of the given spectra, it has the capability in handling complex spectra where features overlap. BPDA2d estimates the posterior existence probability of detected peptides, which can be directly used for probability-based evaluation in subsequent processing steps. Our experiments indicate that BPDA2d outperforms state-of-the-art detection methods on both simulated data and real liquid chromatography–mass spectrometry data, according to sensitivity and detection accuracy.
Availability: The BPDA2d software package is available at http://gsp.tamu.edu/Publications/supplementary/sun11a/
Supplementary data are available at Bioinformatics online.
Motivation: In small-sample settings, bolstered error estimation has been shown to perform better than cross-validation and competitively with bootstrap with regard to various criteria. The key issue for bolstering performance is the variance setting for the bolstering kernel. Heretofore, this variance has been determined in a non-parametric manner from the data. Although bolstering based on this variance setting works well for small feature sets, results can deteriorate for high-dimensional feature spaces.
Results: This article computes an optimal kernel variance depending on the classification rule, sample size, model and feature space, both the original number and the number remaining after feature selection. A key point is that the optimal variance is robust relative to the model. This allows us to develop a method for selecting a suitable variance to use in real-world applications where the model is not known, but the other factors in determining the optimal kernel are known.
Availability: Companion website at http://compbio.tgen.org/paper_supp/high_dim_bolstering
Boolean networks and, more generally, probabilistic Boolean networks, as one class of gene regulatory networks, model biological processes with the network dynamics determined by the logic-rule regulatory functions in conjunction with probabilistic parameters involved in network transitions. While there has been significant research on applying different control policies to alter network dynamics as future gene therapeutic intervention, we have seen less work on understanding the sensitivity of network dynamics with respect to perturbations to networks, including regulatory rules and the involved parameters, which is particularly critical for the design of intervention strategies. This paper studies this less investigated issue of network sensitivity in the long run. As the underlying model of probabilistic Boolean networks is a finite Markov chain, we define the network sensitivity based on the steady-state distributions of probabilistic Boolean networks and call it long-run sensitivity. The steady-state distribution reflects the long-run behavior of the network and it can give insight into the dynamics or momentum existing in a system. The change of steady-state distribution caused by possible perturbations is the key measure for intervention. This newly defined long-run sensitivity can provide insight on both network inference and intervention. We show the results for probabilistic Boolean networks generated from random Boolean networks and the results from two real biological networks illustrate preliminary applications of sensitivity in intervention for practical problems.
Genetic regulatory networks; Boolean networks; probabilistic Boolean networks; Markov chains; sensitivity; steady-state distribution; intervention; metastasis
For science, theoretical or applied, to significantly advance, researchers must use the most appropriate mathematical methods. A century and a half elapsed between Newton’s development of the calculus and Laplace’s development of celestial mechanics. One cannot imagine the latter without the former. Today, more than three-quarters of a century has elapsed since the birth of stochastic systems theory. This article provides a perspective on the utilization of systems theory as the proper vehicle for the development of systems biology and its application to complex regulatory diseases such as cancer.
cancer; control; epistemology; systems biology
Despite initial response in adjuvant chemotherapy, ovarian cancer patients treated with the combination of paclitaxel and carboplatin frequently suffer from recurrence after few cycles of treatment, and the underlying mechanisms causing the chemoresistance remain unclear. Recently, The Cancer Genome Atlas (TCGA) research network concluded an ovarian cancer study and released the dataset to the public. The TCGA dataset possesses large sample size, comprehensive molecular profiles, and clinical outcome information; however, because of the unknown molecular subtypes in ovarian cancer and the great diversity of adjuvant treatments TCGA patients went through, studying chemotherapeutic response using the TCGA data is difficult. Additionally, factors such as sample batches, patient ages, and tumor stages further confound or suppress the identification of relevant genes, and thus the biological functions and disease mechanisms.
To address these issues, herein we propose an analysis procedure designed to reduce suppression effect by focusing on a specific chemotherapeutic treatment, and to remove confounding effects such as batch effect, patient's age, and tumor stages. The proposed procedure starts with a batch effect adjustment, followed by a rigorous sample selection process. Then, the gene expression, copy number, and methylation profiles from the TCGA ovarian cancer dataset are analyzed using a semi-supervised clustering method combined with a novel scoring function. As a result, two molecular classifications, one with poor copy number profiles and one with poor methylation profiles, enriched with unfavorable scores are identified. Compared with the samples enriched with favorable scores, these two classifications exhibit poor progression-free survival (PFS) and might be associated with poor chemotherapy response specifically to the combination of paclitaxel and carboplatin. Significant genes and biological processes are detected subsequently using classical statistical approaches and enrichment analysis.
The proposed procedure for the reduction of confounding and suppression effects and the semi-supervised clustering method are essential steps to identify genes associated with the chemotherapeutic response.
Mass spectrometry is a complex technique used for large-scale protein profiling with clinical and pharmaceutical applications. While individual components in the system have been studied extensively, little work has been done to integrate various modules and evaluate them from a systems point of view.
In this work, we investigate this problem by putting together the different modules in a typical proteomics work flow, in order to capture and analyze key factors that impact the number of identified peptides and quantified proteins, protein quantification error, differential expression results, and classification performance. The proposed proteomics pipeline model can be used to optimize the work flow as well as to pinpoint critical bottlenecks worth investing time and resources into for improving performance. Using the model-based approach proposed here, one can study systematically the critical problem of proteomic biomarker discovery, by means of simulation using ground-truthed synthetic MS data.
Molecularly targeted agents (MTAs) are increasingly used for cancer treatment, the goal being to improve the efficacy and selectivity of cancer treatment by developing agents that block the growth of cancer cells by interfering with specific targeted molecules needed for carcinogenesis and tumor growth. This approach differs from traditional cytotoxic anticancer drugs. The lack of specificity of cytotoxic drugs allows a relatively straightforward approach in preclinical and clinical studies, where the optimal dose has usually been defined as the "maximum tolerated dose" (MTD). This toxicity-based dosing approach is founded on the assumption that the therapeutic anticancer effect and toxic effects of the drug increase in parallel as the dose is escalated. On the contrary, most MTAs are expected to be more selective and less toxic than cytotoxic drugs. Consequently, the maximum therapeutic effect may be achieved at a "biologically effective dose" (BED) well below the MTD. Hence, dosing study for MTAs should be different from cytotoxic drugs. Enhanced efforts to molecularly characterize the drug efficacy for MTAs in preclinical models will be valuable for successfully designing dosing regimens for clinical trials.
A novel preclinical model combining experimental methods and theoretical analysis is proposed to investigate the mechanism of action and identify pharmacodynamic characteristics of the drug. Instead of fixed time point analysis of the drug exposure to drug effect, the time course of drug effect for different doses is quantitatively studied on cell line-based platforms using system identification, where tumor cells' responses to drugs through the use of fluorescent reporters are sampled over a time course. Results show that drug effect is time-varying and higher dosages induce faster and stronger responses as expected. However, the drug efficacy change along different dosages is not linear; on the contrary, there exist certain thresholds. This kind of preclinical study can provide valuable suggestions about dosing regimens for the in vivo experimental stage to increase productivity.
Motivation: There is growing discussion in the bioinformatics community concerning overoptimism of reported results. Two approaches contributing to overoptimism in classification are (i) the reporting of results on datasets for which a proposed classification rule performs well and (ii) the comparison of multiple classification rules on a single dataset that purports to show the advantage of a certain rule.
Results: This article provides a careful probabilistic analysis of the second issue and the ‘multiple-rule bias’, resulting from choosing a classification rule having minimum estimated error on the dataset. It quantifies this bias corresponding to estimating the expected true error of the classification rule possessing minimum estimated error and it characterizes the bias from estimating the true comparative advantage of the chosen classification rule relative to the others by the estimated comparative advantage on the dataset. The analysis is applied to both synthetic and real data using a number of classification rules and error estimators.
Availability: We have implemented in C code the synthetic data distribution model, classification rules, feature selection routines and error estimation methods. The code for multiple-rule analysis is implemented in MATLAB. The source code is available at http://gsp.tamu.edu/Publications/supplementary/yousefi11a/. Supplementary simulation results are also included.
Supplementary Information: Supplementary data are available at Bioinformatics online.
Motivated by the frustration of translation of research advances in the molecular and cellular biology of cancer into treatment, this study calls for cross-disciplinary efforts and proposes a methodology of incorporating drug pharmacology information into drug therapeutic response modeling using a computational systems biology approach. The objectives are two fold. The first one is to involve effective mathematical modeling in the drug development stage to incorporate preclinical and clinical data in order to decrease costs of drug development and increase pipeline productivity, since it is extremely expensive and difficult to get the optimal compromise of dosage and schedule through empirical testing. The second objective is to provide valuable suggestions to adjust individual drug dosing regimens to improve therapeutic effects considering most anticancer agents have wide inter-individual pharmacokinetic variability and a narrow therapeutic index. A dynamic hybrid systems model is proposed to study drug antitumor effect from the perspective of tumor growth dynamics, specifically the dosing and schedule of the periodic drug intake, and a drug’s pharmacokinetics and pharmacodynamics information are linked together in the proposed model using a state-space approach. It is proved analytically that there exists an optimal drug dosage and interval administration point, and demonstrated through simulation study.
drug effect; drug efficacy region; dosing regimens; hybrid systems; systems biology; tumor growth
Classification has emerged as a major area of investigation in bioinformatics owing to the desire to discriminate phenotypes, in particular, disease conditions, using high-throughput genomic data. While many classification rules have been posed, there is a paucity of error estimation rules and an even greater paucity of theory concerning error estimation accuracy. This is problematic because the worth of a classifier depends mainly on its error rate. It is common place in bio-informatics papers to have a classification rule applied to a small labeled data set and the error of the resulting classifier be estimated on the same data set, most often via cross-validation, without any assumptions being made on the underlying feature-label distribution. Concomitant with a lack of distributional assumptions is the absence of any statement regarding the accuracy of the error estimate. Without such a measure of accuracy, the most common one being the root-mean-square (RMS), the error estimate is essentially meaningless and the worth of the entire paper is questionable. The concomitance of an absence of distributional assumptions and of a measure of error estimation accuracy is assured in small-sample settings because even when distribution-free bounds exist (and that is rare), the sample sizes required under the bounds are so large as to make them useless for small samples. Thus, distributional bounds are necessary and the distributional assumptions need to be stated. Owing to the epistemological dependence of classifiers on the accuracy of their estimated errors, scientifically meaningful distribution-free classification in high-throughput, small-sample biology is an illusion.
Classification; epistemology; error estimation; genomics; validation.
Motivation: A key goal of studying biological systems is to design therapeutic intervention strategies. Probabilistic Boolean networks (PBNs) constitute a mathematical model which enables modeling, predicting and intervening in their long-run behavior using Markov chain theory. The long-run dynamics of a PBN, as represented by its steady-state distribution (SSD), can guide the design of effective intervention strategies for the modeled systems. A major obstacle for its application is the large state space of the underlying Markov chain, which poses a serious computational challenge. Hence, it is critical to reduce the model complexity of PBNs for practical applications.
Results: We propose a strategy to reduce the state space of the underlying Markov chain of a PBN based on a criterion that the reduction least distorts the proportional change of stationary masses for critical states, for instance, the network attractors. In comparison to previous reduction methods, we reduce the state space directly, without deleting genes. We then derive stationary control policies on the reduced network that can be naturally induced back to the original network. Computational experiments study the effects of the reduction on model complexity and the performance of designed control policies which is measured by the shift of stationary mass away from undesirable states, those associated with undesirable phenotypes. We consider randomly generated networks as well as a 17-gene gastrointestinal cancer network, which, if not reduced, has a 217 × 217 transition probability matrix. Such a dimension is too large for direct application of many previously proposed PBN intervention strategies.
Supplementary information: Supplementary information are available at Bioinformatics online.
One of the most important goals of the mathematical modeling of gene regulatory networks is to alter their behavior toward desirable phenotypes. Therapeutic techniques are derived for intervention in terms of stationary control policies. In large networks, it becomes computationally burdensome to derive an optimal control policy. To overcome this problem, greedy intervention approaches based on the concept of the Mean First Passage Time or the steady-state probability mass of the network states were previously proposed. Another possible approach is to use reduction mappings to compress the network and develop control policies on its reduced version. However, such mappings lead to loss of information and require an induction step when designing the control policy for the original network.
In this paper, we propose a novel solution, CoD-CP, for designing intervention policies for large Boolean networks. The new method utilizes the Coefficient of Determination (CoD) and the Steady-State Distribution (SSD) of the model. The main advantage of CoD-CP in comparison with the previously proposed methods is that it does not require any compression of the original model, and thus can be directly designed on large networks. The simulation studies on small synthetic networks shows that CoD-CP performs comparable to previously proposed greedy policies that were induced from the compressed versions of the networks. Furthermore, on a large 17-gene gastrointestinal cancer network, CoD-CP outperforms other two available greedy techniques, which is precisely the kind of case for which CoD-CP has been developed. Finally, our experiments show that CoD-CP is robust with respect to the attractor structure of the model.
The newly proposed CoD-CP provides an attractive alternative for intervening large networks where other available greedy methods require size reduction on the network and an extra induction step before designing a control policy.
Accumulation of gene mutations in cells is known to be responsible for tumor progression, driving it from benign states to malignant states. However, previous studies have shown that the detailed sequence of gene mutations, or the steps in tumor progression, may vary from tumor to tumor, making it difficult to infer the exact path that a given type of tumor may have taken.
In this paper, we propose an effective probabilistic algorithm for reconstructing the tumor progression process based on partial knowledge of the underlying gene regulatory network and the steady state distribution of the gene expression values in a given tumor. We take the BNp (Boolean networks with pertubation) framework to model the gene regulatory networks. We assume that the true network is not exactly known but we are given an uncertainty class of networks that contains the true network. This network uncertainty class arises from our partial knowledge of the true network, typically represented as a set of local pathways that are embedded in the global network. Given the SSD of the cancerous network, we aim to simultaneously identify the true normal (healthy) network and the set of gene mutations that drove the network into the cancerous state. This is achieved by analyzing the effect of gene mutation on the SSD of a gene regulatory network. At each step, the proposed algorithm reduces the uncertainty class by keeping only those networks whose SSDs get close enough to the cancerous SSD as a result of additional gene mutation. These steps are repeated until we can find the best candidate for the true network and the most probable path of tumor progression.
Simulation results based on both synthetic networks and networks constructed from actual pathway knowledge show that the proposed algorithm can identify the normal network and the actual path of tumor progression with high probability. The algorithm is also robust to model mismatch and allows us to control the trade-off between efficiency and accuracy.
Because the basic unit of biology is the cell, biological knowledge is rooted in the epistemology of the cell, and because life is the salient characteristic of the cell, its epistemology must be centered on its livingness, not its constituent components. The organization and regulation of these components in the pursuit of life constitute the fundamental nature of the cell. Thus, regulation sits at the heart of biological knowledge of the cell and the extraordinary complexity of this regulation conditions the kind of knowledge that can be obtained, in particular, the representation and intelligibility of that knowledge. This paper is essentially split into two parts. The first part discusses the inadequacy of everyday intelligibility and intuition in science and the consequent need for scientific theories to be expressed mathematically without appeal to commonsense categories of understanding, such as causality. Having set the backdrop, the second part addresses biological knowledge. It briefly reviews modern scientific epistemology from a general perspective and then turns to the epistemology of the cell. In analogy with a multi-faceted factory, the cell utilizes a highly parallel distributed control system to maintain its organization and regulate its dynamical operation in the face of both internal and external changes. Hence, scientific knowledge is constituted by the mathematics of stochastic dynamical systems, which model the overall relational structure of the cell and how these structures evolve over time, stochasticity being a consequence of the need to ignore a large number of factors while modeling relatively few in an extremely complex environment.
Biology; causality; computational biology; epistemology; genomics; systems biology.
Finding reliable gene markers for accurate disease classification is very challenging due to a number of reasons, including the small sample size of typical clinical data, high noise in gene expression measurements, and the heterogeneity across patients. In fact, gene markers identified in independent studies often do not coincide with each other, suggesting that many of the predicted markers may have no biological significance and may be simply artifacts of the analyzed dataset. To find more reliable and reproducible diagnostic markers, several studies proposed to analyze the gene expression data at the level of groups of functionally related genes, such as pathways. Studies have shown that pathway markers tend to be more robust and yield more accurate classification results. One practical problem of the pathway-based approach is the limited coverage of genes by currently known pathways. As a result, potentially important genes that play critical roles in cancer development may be excluded. To overcome this problem, we propose a novel method for identifying reliable subnetwork markers in a human protein-protein interaction (PPI) network.
In this method, we overlay the gene expression data with the PPI network and look for the most discriminative linear paths that consist of discriminative genes that are highly correlated to each other. The overlapping linear paths are then optimally combined into subnetworks that can potentially serve as effective diagnostic markers. We tested our method on two independent large-scale breast cancer datasets and compared the effectiveness and reproducibility of the identified subnetwork markers with gene-based and pathway-based markers. We also compared the proposed method with an existing subnetwork-based method.
The proposed method can efficiently find reliable subnetwork markers that outperform the gene-based and pathway-based markers in terms of discriminative power, reproducibility and classification performance. Subnetwork markers found by our method are highly enriched in common GO terms, and they can more accurately classify breast cancer metastasis compared to markers found by a previous method.
Mass spectrometry (MS) is an essential analytical tool in proteomics. Many existing algorithms for peptide detection are based on isotope template matching and usually work at different charge states separately, making them ineffective to detect overlapping peptides and low abundance peptides.
We present BPDA, a Bayesian approach for peptide detection in data produced by MS instruments with high enough resolution to baseline-resolve isotopic peaks, such as MALDI-TOF and LC-MS. We model the spectra as a mixture of candidate peptide signals, and the model is parameterized by MS physical properties. BPDA is based on a rigorous statistical framework and avoids problems, such as voting and ad-hoc thresholding, generally encountered in algorithms based on template matching. It systematically evaluates all possible combinations of possible peptide candidates to interpret a given spectrum, and iteratively finds the best fitting peptide signal in order to minimize the mean squared error of the inferred spectrum to the observed spectrum. In contrast to previous detection methods, BPDA performs deisotoping and deconvolution of mass spectra simultaneously, which enables better identification of weak peptide signals and produces higher sensitivities and more robust results. Unlike template-matching algorithms, BPDA can handle complex data where features overlap. Our experimental results indicate that BPDA performs well on simulated data and real MS data sets, for various resolutions and signal to noise ratios, and compares very favorably with commonly used commercial and open-source software, such as flexAnalysis, OpenMS, and Decon2LS, according to sensitivity and detection accuracy.
Unlike previous detection methods, which only employ isotopic distributions and work at each single charge state alone, BPDA takes into account the charge state distribution as well, thus lending information to better identify weak peptide signals and produce more robust results. The proposed approach is based on a rigorous statistical framework, which avoids problems generally encountered in algorithms based on template matching. Our experiments indicate that BPDA performs well on both simulated data and real data, and compares very favorably with commonly used commercial and open-source software. The BPDA software can be downloaded from http://gsp.tamu.edu/Publications/supplementary/sun10a/bpda.
We have developed novel molecular methods using a stool sample, which contains intact sloughed colon cells, to quantify colonic gene expression profiles. In this study, our goal was to identify diagnostic gene sets (combinations) for the noninvasive classification of different phenotypes. For this purpose, the effects of a legume-enriched, low glycemic index, high fermentable fiber diet was evaluated in subjects with four possible combinations of risk factors, including insulin resistance and a history of adenomatous polyps. In a randomized crossover design controlled feeding study, each participant (a total of 23; 5–12 per group) consumed the experimental diet (1.5 cups of cooked dry beans) and a control diet (isocaloric average American diet) for 4 weeks with a 3-week washout period between diets. Using prior biological knowledge, the complexity of feature selection was reduced to perform an exhaustive search on all allowable feature (gene) sets of size 3, and among these, 27 had (unbiased) error estimates of 0.15 or less. Linear discriminant analysis was successfully used to identify the best single genes and two- to three-gene combinations for distinguishing subjects with insulin resistance, a history of polyps, or exposure to a chemoprotective legume-rich diet. These results support our premise that gene products (RNA) isolated from stool have diagnostic value in terms of assessing colon cancer risk.
Markovian regulatory networks constitute a class of discrete state-space models used to study gene regulatory dynamics and discover methods that beneficially alter those dynamics. Thereby, this class of models provides a framework to discover effective drug targets and design potent therapeutic strategies. The salient translational goal is to design therapeutic strategies that desirably modify network dynamics via external signals that vary the expressions of a control gene. The objective of an intervention strategy is to reduce the likelihood of the pathological cellular function related to a disease. The task of finding an effective intervention strategy can be formulated as a sequential decision making problem for a pre-defined cost of intervention and a cost-per-stage function that discriminates the gene-activity profiles. An effective intervention strategy prescribes the actions associated with an external signal that result in the minimum expected cost. This strategy in turn can be used as a treatment that reduces the long-run likelihood of gene expressions favorable to the disease. In this tutorial, we briefly summarize the first method proposed to design such therapeutic interventions, and then move on to some of the recent refinements that have been proposed. Each of these recent intervention methods is motivated by practical or analytical considerations. The presentation of the key ideas is facilitated with the help of two case studies.
Regulatory networks; markovian decision processes; translational genomics; systems biology.
When confronted with a small sample, feature-selection algorithms often fail to find good feature sets, a problem exacerbated for high-dimensional data and large feature sets. The problem is compounded by the fact that, if one obtains a feature set with a low error estimate, the estimate is unreliable because training-data-based error estimators typically perform poorly on small samples, exhibiting optimistic bias or high variance. One way around the problem is limit the number of features being considered, restrict features sets to sizes such that all feature sets can be examined by exhaustive search, and report a list of the best performing feature sets. If the list is short, then it greatly restricts the possible feature sets to be considered as candidates; however, one can expect the lowest error estimates obtained to be optimistically biased so that there may not be a close-to-optimal feature set on the list. This paper provides a power analysis of this methodology; in particular, it examines the kind of results one should expect to obtain relative to the length of the list and the number of discriminating features among those considered. Two measures are employed. The first is the probability that there is at least one feature set on the list whose true classification error is within some given tolerance of the best feature set and the second is the expected number of feature sets on the list whose true errors are within the given tolerance of the best feature set. These values are plotted as functions of the list length to generate power curves. The results show that, if the number of discriminating features is not too small—that is, the prior biological knowledge is not too poor—then one should expect, with high probability, to find good feature sets.
Availability: companion website at http://gsp.tamu.edu/Publications/supplementary/zhao09a/
classification; feature ranking; ranking power
High-throughput biological technologies offer the promise of finding feature sets to serve as biomarkers for medical applications; however, the sheer number of potential features (genes, proteins, etc.) means that there needs to be massive feature selection, far greater than that envisioned in the classical literature. This paper considers performance analysis for feature-selection algorithms from two fundamental perspectives: How does the classification accuracy achieved with a selected feature set compare to the accuracy when the best feature set is used and what is the optimal number of features that should be used? The criteria manifest themselves in several issues that need to be considered when examining the efficacy of a feature-selection algorithm: (1) the correlation between the classifier errors for the selected feature set and the theoretically best feature set; (2) the regressions of the aforementioned errors upon one another; (3) the peaking phenomenon, that is, the effect of sample size on feature selection; and (4) the analysis of feature selection in the framework of high-dimensional models corresponding to high-throughput data.
With the advent of high-throughput technologies for measuring genome-wide expression profiles, a large number of methods have been proposed for discovering diagnostic markers that can accurately discriminate between different classes of a disease. However, factors such as the small sample size of typical clinical data, the inherent noise in high-throughput measurements, and the heterogeneity across different samples, often make it difficult to find reliable gene markers. To overcome this problem, several studies have proposed the use of pathway-based markers, instead of individual gene markers, for building the classifier. Given a set of known pathways, these methods estimate the activity level of each pathway by summarizing the expression values of its member genes, and use the pathway activities for classification. It has been shown that pathway-based classifiers typically yield more reliable results compared to traditional gene-based classifiers. In this paper, we propose a new classification method based on probabilistic inference of pathway activities. For a given sample, we compute the log-likelihood ratio between different disease phenotypes based on the expression level of each gene. The activity of a given pathway is then inferred by combining the log-likelihood ratios of the constituent genes. We apply the proposed method to the classification of breast cancer metastasis, and show that it achieves higher accuracy and identifies more reproducible pathway markers compared to several existing pathway activity inference methods.