PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (46)
 

Clipboard (0)
None

Select a Filter Below

Journals
more »
Year of Publication
Document Types
1.  BPDA2d—a 2D global optimization-based Bayesian peptide detection algorithm for liquid chromatograph–mass spectrometry 
Bioinformatics  2011;28(4):564-572.
Motivation: Peptide detection is a crucial step in mass spectrometry (MS) based proteomics. Most existing algorithms are based upon greedy isotope template matching and thus may be prone to error propagation and ineffective to detect overlapping peptides. In addition, existing algorithms usually work at different charge states separately, isolating useful information that can be drawn from other charge states, which may lead to poor detection of low abundance peptides.
Results: BPDA2d models spectra as a mixture of candidate peptide signals and systematically evaluates all possible combinations of possible peptide candidates to interpret the given spectra. For each candidate, BPDA2d takes into account its elution profile, charge state distribution and isotope pattern, and it combines all evidence to infer the candidate's signal and existence probability. By piecing all evidence together—especially by deriving information across charge states—low abundance peptides can be better identified and peptide detection rates can be improved. Instead of local template matching, BPDA2d performs global optimization for all candidates and systematically optimizes their signals. Since BPDA2d looks for the optimal among all possible interpretations of the given spectra, it has the capability in handling complex spectra where features overlap. BPDA2d estimates the posterior existence probability of detected peptides, which can be directly used for probability-based evaluation in subsequent processing steps. Our experiments indicate that BPDA2d outperforms state-of-the-art detection methods on both simulated data and real liquid chromatography–mass spectrometry data, according to sensitivity and detection accuracy.
Availability: The BPDA2d software package is available at http://gsp.tamu.edu/Publications/supplementary/sun11a/
Contact: Michelle.Zhang@utsa.edu; edward@ece.tamu.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr675
PMCID: PMC3278754  PMID: 22155863
2.  High-dimensional bolstered error estimation 
Bioinformatics  2011;27(21):3056-3064.
Motivation: In small-sample settings, bolstered error estimation has been shown to perform better than cross-validation and competitively with bootstrap with regard to various criteria. The key issue for bolstering performance is the variance setting for the bolstering kernel. Heretofore, this variance has been determined in a non-parametric manner from the data. Although bolstering based on this variance setting works well for small feature sets, results can deteriorate for high-dimensional feature spaces.
Results: This article computes an optimal kernel variance depending on the classification rule, sample size, model and feature space, both the original number and the number remaining after feature selection. A key point is that the optimal variance is robust relative to the model. This allows us to develop a method for selecting a suitable variance to use in real-world applications where the model is not known, but the other factors in determining the optimal kernel are known.
Availability: Companion website at http://compbio.tgen.org/paper_supp/high_dim_bolstering
Contact: edward@mail.ece.tamu.edu
doi:10.1093/bioinformatics/btr518
PMCID: PMC3198579  PMID: 21914630
3.  Newton, Laplace, and The Epistemology of Systems Biology 
Cancer Informatics  2012;11:185-190.
For science, theoretical or applied, to significantly advance, researchers must use the most appropriate mathematical methods. A century and a half elapsed between Newton’s development of the calculus and Laplace’s development of celestial mechanics. One cannot imagine the latter without the former. Today, more than three-quarters of a century has elapsed since the birth of stochastic systems theory. This article provides a perspective on the utilization of systems theory as the proper vehicle for the development of systems biology and its application to complex regulatory diseases such as cancer.
doi:10.4137/CIN.S10630
PMCID: PMC3493142  PMID: 23170064
cancer; control; epistemology; systems biology
4.  Reducing confounding and suppression effects in TCGA data: an integrated analysis of chemotherapy response in ovarian cancer 
BMC Genomics  2012;13(Suppl 6):S13.
Background
Despite initial response in adjuvant chemotherapy, ovarian cancer patients treated with the combination of paclitaxel and carboplatin frequently suffer from recurrence after few cycles of treatment, and the underlying mechanisms causing the chemoresistance remain unclear. Recently, The Cancer Genome Atlas (TCGA) research network concluded an ovarian cancer study and released the dataset to the public. The TCGA dataset possesses large sample size, comprehensive molecular profiles, and clinical outcome information; however, because of the unknown molecular subtypes in ovarian cancer and the great diversity of adjuvant treatments TCGA patients went through, studying chemotherapeutic response using the TCGA data is difficult. Additionally, factors such as sample batches, patient ages, and tumor stages further confound or suppress the identification of relevant genes, and thus the biological functions and disease mechanisms.
Results
To address these issues, herein we propose an analysis procedure designed to reduce suppression effect by focusing on a specific chemotherapeutic treatment, and to remove confounding effects such as batch effect, patient's age, and tumor stages. The proposed procedure starts with a batch effect adjustment, followed by a rigorous sample selection process. Then, the gene expression, copy number, and methylation profiles from the TCGA ovarian cancer dataset are analyzed using a semi-supervised clustering method combined with a novel scoring function. As a result, two molecular classifications, one with poor copy number profiles and one with poor methylation profiles, enriched with unfavorable scores are identified. Compared with the samples enriched with favorable scores, these two classifications exhibit poor progression-free survival (PFS) and might be associated with poor chemotherapy response specifically to the combination of paclitaxel and carboplatin. Significant genes and biological processes are detected subsequently using classical statistical approaches and enrichment analysis.
Conclusions
The proposed procedure for the reduction of confounding and suppression effects and the semi-supervised clustering method are essential steps to identify genes associated with the chemotherapeutic response.
doi:10.1186/1471-2164-13-S6-S13
PMCID: PMC3481440  PMID: 23134756
5.  A systematic model of the LC-MS proteomics pipeline 
BMC Genomics  2012;13(Suppl 6):S2.
Motivation
Mass spectrometry is a complex technique used for large-scale protein profiling with clinical and pharmaceutical applications. While individual components in the system have been studied extensively, little work has been done to integrate various modules and evaluate them from a systems point of view.
Results
In this work, we investigate this problem by putting together the different modules in a typical proteomics work flow, in order to capture and analyze key factors that impact the number of identified peptides and quantified proteins, protein quantification error, differential expression results, and classification performance. The proposed proteomics pipeline model can be used to optimize the work flow as well as to pinpoint critical bottlenecks worth investing time and resources into for improving performance. Using the model-based approach proposed here, one can study systematically the critical problem of proteomic biomarker discovery, by means of simulation using ground-truthed synthetic MS data.
doi:10.1186/1471-2164-13-S6-S2
PMCID: PMC3481448  PMID: 23134670
6.  Assessing the efficacy of molecularly targeted agents on cell line-based platforms by using system identification 
BMC Genomics  2012;13(Suppl 6):S11.
Background
Molecularly targeted agents (MTAs) are increasingly used for cancer treatment, the goal being to improve the efficacy and selectivity of cancer treatment by developing agents that block the growth of cancer cells by interfering with specific targeted molecules needed for carcinogenesis and tumor growth. This approach differs from traditional cytotoxic anticancer drugs. The lack of specificity of cytotoxic drugs allows a relatively straightforward approach in preclinical and clinical studies, where the optimal dose has usually been defined as the "maximum tolerated dose" (MTD). This toxicity-based dosing approach is founded on the assumption that the therapeutic anticancer effect and toxic effects of the drug increase in parallel as the dose is escalated. On the contrary, most MTAs are expected to be more selective and less toxic than cytotoxic drugs. Consequently, the maximum therapeutic effect may be achieved at a "biologically effective dose" (BED) well below the MTD. Hence, dosing study for MTAs should be different from cytotoxic drugs. Enhanced efforts to molecularly characterize the drug efficacy for MTAs in preclinical models will be valuable for successfully designing dosing regimens for clinical trials.
Results
A novel preclinical model combining experimental methods and theoretical analysis is proposed to investigate the mechanism of action and identify pharmacodynamic characteristics of the drug. Instead of fixed time point analysis of the drug exposure to drug effect, the time course of drug effect for different doses is quantitatively studied on cell line-based platforms using system identification, where tumor cells' responses to drugs through the use of fluorescent reporters are sampled over a time course. Results show that drug effect is time-varying and higher dosages induce faster and stronger responses as expected. However, the drug efficacy change along different dosages is not linear; on the contrary, there exist certain thresholds. This kind of preclinical study can provide valuable suggestions about dosing regimens for the in vivo experimental stage to increase productivity.
doi:10.1186/1471-2164-13-S6-S11
PMCID: PMC3481481  PMID: 23134733
7.  Multiple-rule bias in the comparison of classification rules 
Bioinformatics  2011;27(12):1675-1683.
Motivation: There is growing discussion in the bioinformatics community concerning overoptimism of reported results. Two approaches contributing to overoptimism in classification are (i) the reporting of results on datasets for which a proposed classification rule performs well and (ii) the comparison of multiple classification rules on a single dataset that purports to show the advantage of a certain rule.
Results: This article provides a careful probabilistic analysis of the second issue and the ‘multiple-rule bias’, resulting from choosing a classification rule having minimum estimated error on the dataset. It quantifies this bias corresponding to estimating the expected true error of the classification rule possessing minimum estimated error and it characterizes the bias from estimating the true comparative advantage of the chosen classification rule relative to the others by the estimated comparative advantage on the dataset. The analysis is applied to both synthetic and real data using a number of classification rules and error estimators.
Availability: We have implemented in C code the synthetic data distribution model, classification rules, feature selection routines and error estimation methods. The code for multiple-rule analysis is implemented in MATLAB. The source code is available at http://gsp.tamu.edu/Publications/supplementary/yousefi11a/. Supplementary simulation results are also included.
Contact: edward@ece.tamu.edu
Supplementary Information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr262
PMCID: PMC3106200  PMID: 21546390
8.  On the Long-run Sensitivity of Probabilistic Boolean Networks 
Journal of theoretical biology  2008;257(4):560-577.
Boolean networks and, more generally, probabilistic Boolean networks, as one class of gene regulatory networks, model biological processes with the network dynamics determined by the logic-rule regulatory functions in conjunction with probabilistic parameters involved in network transitions. While there has been significant research on applying different control policies to alter network dynamics as future gene therapeutic intervention, we have seen less work on understanding the sensitivity of network dynamics with respect to perturbations to networks, including regulatory rules and the involved parameters, which is particularly critical for the design of intervention strategies. This paper studies this less investigated issue of network sensitivity in the long run. As the underlying model of probabilistic Boolean networks is a finite Markov chain, we define the network sensitivity based on the steady-state distributions of probabilistic Boolean networks and call it long-run sensitivity. The steady-state distribution reflects the long-run behavior of the network and it can give insight into the dynamics or momentum existing in a system. The change of steady-state distribution caused by possible perturbations is the key measure for intervention. This newly defined long-run sensitivity can provide insight on both network inference and intervention. We show the results for probabilistic Boolean networks generated from random Boolean networks and the results from two real biological networks illustrate preliminary applications of sensitivity in intervention for practical problems.
doi:10.1016/j.jtbi.2008.12.023
PMCID: PMC2660388  PMID: 19168076
Genetic regulatory networks; Boolean networks; probabilistic Boolean networks; Markov chains; sensitivity; steady-state distribution; intervention; metastasis
9.  A Systems Biology Approach in Therapeutic Response Study for Different Dosing Regimens—a Modeling Study of Drug Effects on Tumor Growth using Hybrid Systems 
Cancer Informatics  2012;11:41-60.
Motivated by the frustration of translation of research advances in the molecular and cellular biology of cancer into treatment, this study calls for cross-disciplinary efforts and proposes a methodology of incorporating drug pharmacology information into drug therapeutic response modeling using a computational systems biology approach. The objectives are two fold. The first one is to involve effective mathematical modeling in the drug development stage to incorporate preclinical and clinical data in order to decrease costs of drug development and increase pipeline productivity, since it is extremely expensive and difficult to get the optimal compromise of dosage and schedule through empirical testing. The second objective is to provide valuable suggestions to adjust individual drug dosing regimens to improve therapeutic effects considering most anticancer agents have wide inter-individual pharmacokinetic variability and a narrow therapeutic index. A dynamic hybrid systems model is proposed to study drug antitumor effect from the perspective of tumor growth dynamics, specifically the dosing and schedule of the periodic drug intake, and a drug’s pharmacokinetics and pharmacodynamics information are linked together in the proposed model using a state-space approach. It is proved analytically that there exists an optimal drug dosage and interval administration point, and demonstrated through simulation study.
doi:10.4137/CIN.S8185
PMCID: PMC3298374  PMID: 22442626
drug effect; drug efficacy region; dosing regimens; hybrid systems; systems biology; tumor growth
10.  The Illusion of Distribution-Free Small-Sample Classification in Genomics 
Current Genomics  2011;12(5):333-341.
Classification has emerged as a major area of investigation in bioinformatics owing to the desire to discriminate phenotypes, in particular, disease conditions, using high-throughput genomic data. While many classification rules have been posed, there is a paucity of error estimation rules and an even greater paucity of theory concerning error estimation accuracy. This is problematic because the worth of a classifier depends mainly on its error rate. It is common place in bio-informatics papers to have a classification rule applied to a small labeled data set and the error of the resulting classifier be estimated on the same data set, most often via cross-validation, without any assumptions being made on the underlying feature-label distribution. Concomitant with a lack of distributional assumptions is the absence of any statement regarding the accuracy of the error estimate. Without such a measure of accuracy, the most common one being the root-mean-square (RMS), the error estimate is essentially meaningless and the worth of the entire paper is questionable. The concomitance of an absence of distributional assumptions and of a measure of error estimation accuracy is assured in small-sample settings because even when distribution-free bounds exist (and that is rare), the sample sizes required under the bounds are so large as to make them useless for small samples. Thus, distributional bounds are necessary and the distributional assumptions need to be stated. Owing to the epistemological dependence of classifiers on the accuracy of their estimated errors, scientifically meaningful distribution-free classification in high-throughput, small-sample biology is an illusion.
doi:10.2174/138920211796429763
PMCID: PMC3145263  PMID: 22294876
Classification; epistemology; error estimation; genomics; validation.
11.  State reduction for network intervention in probabilistic Boolean networks 
Bioinformatics  2010;26(24):3098-3104.
Motivation: A key goal of studying biological systems is to design therapeutic intervention strategies. Probabilistic Boolean networks (PBNs) constitute a mathematical model which enables modeling, predicting and intervening in their long-run behavior using Markov chain theory. The long-run dynamics of a PBN, as represented by its steady-state distribution (SSD), can guide the design of effective intervention strategies for the modeled systems. A major obstacle for its application is the large state space of the underlying Markov chain, which poses a serious computational challenge. Hence, it is critical to reduce the model complexity of PBNs for practical applications.
Results: We propose a strategy to reduce the state space of the underlying Markov chain of a PBN based on a criterion that the reduction least distorts the proportional change of stationary masses for critical states, for instance, the network attractors. In comparison to previous reduction methods, we reduce the state space directly, without deleting genes. We then derive stationary control policies on the reduced network that can be naturally induced back to the original network. Computational experiments study the effects of the reduction on model complexity and the performance of designed control policies which is measured by the shift of stationary mass away from undesirable states, those associated with undesirable phenotypes. We consider randomly generated networks as well as a 17-gene gastrointestinal cancer network, which, if not reduced, has a 217 × 217 transition probability matrix. Such a dimension is too large for direct application of many previously proposed PBN intervention strategies.
Contact: xqian@cse.usf.edu
Supplementary information: Supplementary information are available at Bioinformatics online.
doi:10.1093/bioinformatics/btq575
PMCID: PMC3025721  PMID: 20956246
12.  A CoD-based stationary control policy for intervening in large gene regulatory networks 
BMC Bioinformatics  2011;12(Suppl 10):S10.
Background
One of the most important goals of the mathematical modeling of gene regulatory networks is to alter their behavior toward desirable phenotypes. Therapeutic techniques are derived for intervention in terms of stationary control policies. In large networks, it becomes computationally burdensome to derive an optimal control policy. To overcome this problem, greedy intervention approaches based on the concept of the Mean First Passage Time or the steady-state probability mass of the network states were previously proposed. Another possible approach is to use reduction mappings to compress the network and develop control policies on its reduced version. However, such mappings lead to loss of information and require an induction step when designing the control policy for the original network.
Results
In this paper, we propose a novel solution, CoD-CP, for designing intervention policies for large Boolean networks. The new method utilizes the Coefficient of Determination (CoD) and the Steady-State Distribution (SSD) of the model. The main advantage of CoD-CP in comparison with the previously proposed methods is that it does not require any compression of the original model, and thus can be directly designed on large networks. The simulation studies on small synthetic networks shows that CoD-CP performs comparable to previously proposed greedy policies that were induced from the compressed versions of the networks. Furthermore, on a large 17-gene gastrointestinal cancer network, CoD-CP outperforms other two available greedy techniques, which is precisely the kind of case for which CoD-CP has been developed. Finally, our experiments show that CoD-CP is robust with respect to the attractor structure of the model.
Conclusions
The newly proposed CoD-CP provides an attractive alternative for intervening large networks where other available greedy methods require size reduction on the network and an extra induction step before designing a control policy.
doi:10.1186/1471-2105-12-S10-S10
PMCID: PMC3236832  PMID: 22165980
13.  Probabilistic reconstruction of the tumor progression process in gene regulatory networks in the presence of uncertainty 
BMC Bioinformatics  2011;12(Suppl 10):S9.
Background
Accumulation of gene mutations in cells is known to be responsible for tumor progression, driving it from benign states to malignant states. However, previous studies have shown that the detailed sequence of gene mutations, or the steps in tumor progression, may vary from tumor to tumor, making it difficult to infer the exact path that a given type of tumor may have taken.
Results
In this paper, we propose an effective probabilistic algorithm for reconstructing the tumor progression process based on partial knowledge of the underlying gene regulatory network and the steady state distribution of the gene expression values in a given tumor. We take the BNp (Boolean networks with pertubation) framework to model the gene regulatory networks. We assume that the true network is not exactly known but we are given an uncertainty class of networks that contains the true network. This network uncertainty class arises from our partial knowledge of the true network, typically represented as a set of local pathways that are embedded in the global network. Given the SSD of the cancerous network, we aim to simultaneously identify the true normal (healthy) network and the set of gene mutations that drove the network into the cancerous state. This is achieved by analyzing the effect of gene mutation on the SSD of a gene regulatory network. At each step, the proposed algorithm reduces the uncertainty class by keeping only those networks whose SSDs get close enough to the cancerous SSD as a result of additional gene mutation. These steps are repeated until we can find the best candidate for the true network and the most probable path of tumor progression.
Conclusions
Simulation results based on both synthetic networks and networks constructed from actual pathway knowledge show that the proposed algorithm can identify the normal network and the actual path of tumor progression with high probability. The algorithm is also robust to model mismatch and allows us to control the trade-off between efficiency and accuracy.
doi:10.1186/1471-2105-12-S10-S9
PMCID: PMC3236852  PMID: 22166046
14.  Causality, Randomness, Intelligibility, and the Epistemology of the Cell 
Current Genomics  2010;11(4):221-237.
Because the basic unit of biology is the cell, biological knowledge is rooted in the epistemology of the cell, and because life is the salient characteristic of the cell, its epistemology must be centered on its livingness, not its constituent components. The organization and regulation of these components in the pursuit of life constitute the fundamental nature of the cell. Thus, regulation sits at the heart of biological knowledge of the cell and the extraordinary complexity of this regulation conditions the kind of knowledge that can be obtained, in particular, the representation and intelligibility of that knowledge. This paper is essentially split into two parts. The first part discusses the inadequacy of everyday intelligibility and intuition in science and the consequent need for scientific theories to be expressed mathematically without appeal to commonsense categories of understanding, such as causality. Having set the backdrop, the second part addresses biological knowledge. It briefly reviews modern scientific epistemology from a general perspective and then turns to the epistemology of the cell. In analogy with a multi-faceted factory, the cell utilizes a highly parallel distributed control system to maintain its organization and regulate its dynamical operation in the face of both internal and external changes. Hence, scientific knowledge is constituted by the mathematics of stochastic dynamical systems, which model the overall relational structure of the cell and how these structures evolve over time, stochasticity being a consequence of the need to ignore a large number of factors while modeling relatively few in an extremely complex environment.
doi:10.2174/138920210791233072
PMCID: PMC2930662  PMID: 21119887
Biology; causality; computational biology; epistemology; genomics; systems biology.
15.  Identification of diagnostic subnetwork markers for cancer in human protein-protein interaction network 
BMC Bioinformatics  2010;11(Suppl 6):S8.
Background
Finding reliable gene markers for accurate disease classification is very challenging due to a number of reasons, including the small sample size of typical clinical data, high noise in gene expression measurements, and the heterogeneity across patients. In fact, gene markers identified in independent studies often do not coincide with each other, suggesting that many of the predicted markers may have no biological significance and may be simply artifacts of the analyzed dataset. To find more reliable and reproducible diagnostic markers, several studies proposed to analyze the gene expression data at the level of groups of functionally related genes, such as pathways. Studies have shown that pathway markers tend to be more robust and yield more accurate classification results. One practical problem of the pathway-based approach is the limited coverage of genes by currently known pathways. As a result, potentially important genes that play critical roles in cancer development may be excluded. To overcome this problem, we propose a novel method for identifying reliable subnetwork markers in a human protein-protein interaction (PPI) network.
Results
In this method, we overlay the gene expression data with the PPI network and look for the most discriminative linear paths that consist of discriminative genes that are highly correlated to each other. The overlapping linear paths are then optimally combined into subnetworks that can potentially serve as effective diagnostic markers. We tested our method on two independent large-scale breast cancer datasets and compared the effectiveness and reproducibility of the identified subnetwork markers with gene-based and pathway-based markers. We also compared the proposed method with an existing subnetwork-based method.
Conclusions
The proposed method can efficiently find reliable subnetwork markers that outperform the gene-based and pathway-based markers in terms of discriminative power, reproducibility and classification performance. Subnetwork markers found by our method are highly enriched in common GO terms, and they can more accurately classify breast cancer metastasis compared to markers found by a previous method.
doi:10.1186/1471-2105-11-S6-S8
PMCID: PMC3026382  PMID: 20946619
16.  BPDA - A Bayesian peptide detection algorithm for mass spectrometry 
BMC Bioinformatics  2010;11:490.
Background
Mass spectrometry (MS) is an essential analytical tool in proteomics. Many existing algorithms for peptide detection are based on isotope template matching and usually work at different charge states separately, making them ineffective to detect overlapping peptides and low abundance peptides.
Results
We present BPDA, a Bayesian approach for peptide detection in data produced by MS instruments with high enough resolution to baseline-resolve isotopic peaks, such as MALDI-TOF and LC-MS. We model the spectra as a mixture of candidate peptide signals, and the model is parameterized by MS physical properties. BPDA is based on a rigorous statistical framework and avoids problems, such as voting and ad-hoc thresholding, generally encountered in algorithms based on template matching. It systematically evaluates all possible combinations of possible peptide candidates to interpret a given spectrum, and iteratively finds the best fitting peptide signal in order to minimize the mean squared error of the inferred spectrum to the observed spectrum. In contrast to previous detection methods, BPDA performs deisotoping and deconvolution of mass spectra simultaneously, which enables better identification of weak peptide signals and produces higher sensitivities and more robust results. Unlike template-matching algorithms, BPDA can handle complex data where features overlap. Our experimental results indicate that BPDA performs well on simulated data and real MS data sets, for various resolutions and signal to noise ratios, and compares very favorably with commonly used commercial and open-source software, such as flexAnalysis, OpenMS, and Decon2LS, according to sensitivity and detection accuracy.
Conclusion
Unlike previous detection methods, which only employ isotopic distributions and work at each single charge state alone, BPDA takes into account the charge state distribution as well, thus lending information to better identify weak peptide signals and produce more robust results. The proposed approach is based on a rigorous statistical framework, which avoids problems generally encountered in algorithms based on template matching. Our experiments indicate that BPDA performs well on both simulated data and real data, and compares very favorably with commonly used commercial and open-source software. The BPDA software can be downloaded from http://gsp.tamu.edu/Publications/supplementary/sun10a/bpda.
doi:10.1186/1471-2105-11-490
PMCID: PMC3098078  PMID: 20920238
17.  Noninvasive Detection of Candidate Molecular Biomarkers in Subjects with a History of Insulin Resistance and Colorectal Adenomas 
We have developed novel molecular methods using a stool sample, which contains intact sloughed colon cells, to quantify colonic gene expression profiles. In this study, our goal was to identify diagnostic gene sets (combinations) for the noninvasive classification of different phenotypes. For this purpose, the effects of a legume-enriched, low glycemic index, high fermentable fiber diet was evaluated in subjects with four possible combinations of risk factors, including insulin resistance and a history of adenomatous polyps. In a randomized crossover design controlled feeding study, each participant (a total of 23; 5–12 per group) consumed the experimental diet (1.5 cups of cooked dry beans) and a control diet (isocaloric average American diet) for 4 weeks with a 3-week washout period between diets. Using prior biological knowledge, the complexity of feature selection was reduced to perform an exhaustive search on all allowable feature (gene) sets of size 3, and among these, 27 had (unbiased) error estimates of 0.15 or less. Linear discriminant analysis was successfully used to identify the best single genes and two- to three-gene combinations for distinguishing subjects with insulin resistance, a history of polyps, or exposure to a chemoprotective legume-rich diet. These results support our premise that gene products (RNA) isolated from stool have diagnostic value in terms of assessing colon cancer risk.
doi:10.1158/1940-6207.CAPR-08-0233
PMCID: PMC2745241  PMID: 19470793
18.  Recent Advances in Intervention in Markovian Regulatory Networks 
Current Genomics  2009;10(7):463-477.
Markovian regulatory networks constitute a class of discrete state-space models used to study gene regulatory dynamics and discover methods that beneficially alter those dynamics. Thereby, this class of models provides a framework to discover effective drug targets and design potent therapeutic strategies. The salient translational goal is to design therapeutic strategies that desirably modify network dynamics via external signals that vary the expressions of a control gene. The objective of an intervention strategy is to reduce the likelihood of the pathological cellular function related to a disease. The task of finding an effective intervention strategy can be formulated as a sequential decision making problem for a pre-defined cost of intervention and a cost-per-stage function that discriminates the gene-activity profiles. An effective intervention strategy prescribes the actions associated with an external signal that result in the minimum expected cost. This strategy in turn can be used as a treatment that reduces the long-run likelihood of gene expressions favorable to the disease. In this tutorial, we briefly summarize the first method proposed to design such therapeutic interventions, and then move on to some of the recent refinements that have been proposed. Each of these recent intervention methods is motivated by practical or analytical considerations. The presentation of the key ideas is facilitated with the help of two case studies.
doi:10.2174/138920209789208246
PMCID: PMC2808674  PMID: 20436874
Regulatory networks; markovian decision processes; translational genomics; systems biology.
19.  Characterization of the Effectiveness of Reporting Lists of Small Feature Sets Relative to the Accuracy of the Prior Biological Knowledge 
Cancer Informatics  2010;9:49-60.
When confronted with a small sample, feature-selection algorithms often fail to find good feature sets, a problem exacerbated for high-dimensional data and large feature sets. The problem is compounded by the fact that, if one obtains a feature set with a low error estimate, the estimate is unreliable because training-data-based error estimators typically perform poorly on small samples, exhibiting optimistic bias or high variance. One way around the problem is limit the number of features being considered, restrict features sets to sizes such that all feature sets can be examined by exhaustive search, and report a list of the best performing feature sets. If the list is short, then it greatly restricts the possible feature sets to be considered as candidates; however, one can expect the lowest error estimates obtained to be optimistically biased so that there may not be a close-to-optimal feature set on the list. This paper provides a power analysis of this methodology; in particular, it examines the kind of results one should expect to obtain relative to the length of the list and the number of discriminating features among those considered. Two measures are employed. The first is the probability that there is at least one feature set on the list whose true classification error is within some given tolerance of the best feature set and the second is the expected number of feature sets on the list whose true errors are within the given tolerance of the best feature set. These values are plotted as functions of the list length to generate power curves. The results show that, if the number of discriminating features is not too small—that is, the prior biological knowledge is not too poor—then one should expect, with high probability, to find good feature sets.
Availability: companion website at http://gsp.tamu.edu/Publications/supplementary/zhao09a/
PMCID: PMC2865771  PMID: 20458361
classification; feature ranking; ranking power
20.  Performance of Feature Selection Methods 
Current Genomics  2009;10(6):365-374.
High-throughput biological technologies offer the promise of finding feature sets to serve as biomarkers for medical applications; however, the sheer number of potential features (genes, proteins, etc.) means that there needs to be massive feature selection, far greater than that envisioned in the classical literature. This paper considers performance analysis for feature-selection algorithms from two fundamental perspectives: How does the classification accuracy achieved with a selected feature set compare to the accuracy when the best feature set is used and what is the optimal number of features that should be used? The criteria manifest themselves in several issues that need to be considered when examining the efficacy of a feature-selection algorithm: (1) the correlation between the classifier errors for the selected feature set and the theoretically best feature set; (2) the regressions of the aforementioned errors upon one another; (3) the peaking phenomenon, that is, the effect of sample size on feature selection; and (4) the analysis of feature selection in the framework of high-dimensional models corresponding to high-throughput data.
doi:10.2174/138920209789177629
PMCID: PMC2766788  PMID: 20190952
21.  Genomic Signal Processing 
Current Genomics  2009;10(6):364.
doi:10.2174/138920209789177593
PMCID: PMC2766787  PMID: 20190951
22.  Accurate and Reliable Cancer Classification Based on Probabilistic Inference of Pathway Activity 
PLoS ONE  2009;4(12):e8161.
With the advent of high-throughput technologies for measuring genome-wide expression profiles, a large number of methods have been proposed for discovering diagnostic markers that can accurately discriminate between different classes of a disease. However, factors such as the small sample size of typical clinical data, the inherent noise in high-throughput measurements, and the heterogeneity across different samples, often make it difficult to find reliable gene markers. To overcome this problem, several studies have proposed the use of pathway-based markers, instead of individual gene markers, for building the classifier. Given a set of known pathways, these methods estimate the activity level of each pathway by summarizing the expression values of its member genes, and use the pathway activities for classification. It has been shown that pathway-based classifiers typically yield more reliable results compared to traditional gene-based classifiers. In this paper, we propose a new classification method based on probabilistic inference of pathway activities. For a given sample, we compute the log-likelihood ratio between different disease phenotypes based on the expression level of each gene. The activity of a given pathway is then inferred by combining the log-likelihood ratios of the constituent genes. We apply the proposed method to the classification of breast cancer metastasis, and show that it achieves higher accuracy and identifies more reproducible pathway markers compared to several existing pathway activity inference methods.
doi:10.1371/journal.pone.0008161
PMCID: PMC2781165  PMID: 19997592
23.  Analysis and modeling of time-course gene-expression profiles from nanomaterial-exposed primary human epidermal keratinocytes 
BMC Bioinformatics  2009;10(Suppl 11):S10.
Background
Nanomaterials are being manufactured on a commercial scale for use in medical, diagnostic, energy, component and communications industries. However, concerns over the safety of engineered nanomaterials have surfaced. Humans can be exposed to nanomaterials in different ways such as inhalation or exposure through the integumentary system.
Results
The interactions of engineered nanomaterials with primary human cells was investigated, using a systems biology approach combining gene expression microarray profiling with dynamic experimental parameters. In this experiment, primary human epidermal keratinocytes cells were exposed to several low-micron to nano-scale materials, and gene expression was profiled over both time and dose to compile a comprehensive picture of nanomaterial-cellular interactions. Very few gene-expression studies so far have dealt with both time and dose response simultaneously. Here, we propose different approaches to this kind of analysis. First, we used heat maps and multi-dimensional scaling (MDS) plots to visualize the dose response of nanomaterials over time. Then, in order to find out the most common patterns in gene-expression profiles, we used self-organizing maps (SOM) combined with two different criteria to determine the number of clusters. The consistency of SOM results is discussed in context of the information derived from the MDS plots. Finally, in order to identify the genes that have significantly different responses among different levels of dose of each treatment while accounting for the effect of time at the same time, we used a two-way ANOVA model, in connection with Tukey's additivity test and the Box-Cox transformation. The results are discussed in the context of the cellular responses of engineered nanomaterials.
Conclusion
The analysis presented here lead to interesting and complementary conclusions about the response across time of human epidermal keratinocytes after exposure to nanomaterials. For example, we observed that gene expression for most treatments become closer to the expression of the baseline cultures as time proceeds. The genes found to be differentially-expressed are involved in a number of cellular processes, including regulation of transcription and translation, protein localization, transport, cell cycle progression, cell migration, cytoskeletal reorganization, signal transduction, and development.
doi:10.1186/1471-2105-10-S11-S10
PMCID: PMC3226183  PMID: 19811675
24.  Translational Science: Epistemology and the Investigative Process 
Current Genomics  2009;10(2):102-109.
The term “translational science” has recently become very popular with its usage appearing to be almost exclusively related to medicine, in particular, the “translation” of biological knowledge into medical practice. Taking the perspective that translational science is somehow different than science and that sound science is grounded in an epistemology developed over millennia, it seems imperative that the meaning of translational science be carefully examined, especially how the scientific epistemology manifests itself in translational science. This paper examines epistemological issues relating mainly to modeling in translational science, with a focus on optimal operator synthesis. It goes on to discuss the implications of epistemology on the nature of collaborations conducive to the translational investigative process. The philosophical concepts are illustrated by considering intervention in gene regulatory networks.
doi:10.2174/138920209787847005
PMCID: PMC2699826  PMID: 19794882
25.  Intervention in gene regulatory networks via greedy control policies based on long-run behavior 
BMC Systems Biology  2009;3:61.
Background
A salient purpose for studying gene regulatory networks is to derive intervention strategies, the goals being to identify potential drug targets and design gene-based therapeutic intervention. Optimal stochastic control based on the transition probability matrix of the underlying Markov chain has been studied extensively for probabilistic Boolean networks. Optimization is based on minimization of a cost function and a key goal of control is to reduce the steady-state probability mass of undesirable network states. Owing to computational complexity, it is difficult to apply optimal control for large networks.
Results
In this paper, we propose three new greedy stationary control policies by directly investigating the effects on the network long-run behavior. Similar to the recently proposed mean-first-passage-time (MFPT) control policy, these policies do not depend on minimization of a cost function and avoid the computational burden of dynamic programming. They can be used to design stationary control policies that avoid the need for a user-defined cost function because they are based directly on long-run network behavior; they can be used as an alternative to dynamic programming algorithms when the latter are computationally prohibitive; and they can be used to predict the best control gene with reduced computational complexity, even when one is employing dynamic programming to derive the final control policy. We compare the performance of these three greedy control policies and the MFPT policy using randomly generated probabilistic Boolean networks and give a preliminary example for intervening in a mammalian cell cycle network.
Conclusion
The newly proposed control policies have better performance in general than the MFPT policy and, as indicated by the results on the mammalian cell cycle network, they can potentially serve as future gene therapeutic intervention strategies.
doi:10.1186/1752-0509-3-61
PMCID: PMC2728102  PMID: 19527511

Results 1-25 (46)