Search tips
Search criteria

Results 1-9 (9)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
Document Types
1.  Evaluating Gene Set Enrichment Analysis Via a Hybrid Data Model 
Cancer Informatics  2014;13(Suppl 1):1-16.
Gene set enrichment analysis (GSA) methods have been widely adopted by biological labs to analyze data and generate hypotheses for validation. Most of the existing comparison studies focus on whether the existing GSA methods can produce accurate P-values; however, practitioners are often more concerned with the correct gene-set ranking generated by the methods. The ranking performance is closely related to two critical goals associated with GSA methods: the ability to reveal biological themes and ensuring reproducibility, especially for small-sample studies. We have conducted a comprehensive simulation study focusing on the ranking performance of seven representative GSA methods. We overcome the limitation on the availability of real data sets by creating hybrid data models from existing large data sets. To build the data model, we pick a master gene from the data set to form the ground truth and artificially generate the phenotype labels. Multiple hybrid data models can be constructed from one data set and multiple data sets of smaller sizes can be generated by resampling the original data set. This approach enables us to generate a large batch of data sets to check the ranking performance of GSA methods. Our simulation study reveals that for the proposed data model, the Q2 type GSA methods have in general better performance than other GSA methods and the global test has the most robust results. The properties of a data set play a critical role in the performance. For the data sets with highly connected genes, all GSA methods suffer significantly in performance.
PMCID: PMC3929260  PMID: 24558298
gene set enrichment analysis; feature ranking; data model; simulation study
2.  Newton, Laplace, and The Epistemology of Systems Biology 
Cancer Informatics  2012;11:185-190.
For science, theoretical or applied, to significantly advance, researchers must use the most appropriate mathematical methods. A century and a half elapsed between Newton’s development of the calculus and Laplace’s development of celestial mechanics. One cannot imagine the latter without the former. Today, more than three-quarters of a century has elapsed since the birth of stochastic systems theory. This article provides a perspective on the utilization of systems theory as the proper vehicle for the development of systems biology and its application to complex regulatory diseases such as cancer.
PMCID: PMC3493142  PMID: 23170064
cancer; control; epistemology; systems biology
3.  Assessing the efficacy of molecularly targeted agents on cell line-based platforms by using system identification 
BMC Genomics  2012;13(Suppl 6):S11.
Molecularly targeted agents (MTAs) are increasingly used for cancer treatment, the goal being to improve the efficacy and selectivity of cancer treatment by developing agents that block the growth of cancer cells by interfering with specific targeted molecules needed for carcinogenesis and tumor growth. This approach differs from traditional cytotoxic anticancer drugs. The lack of specificity of cytotoxic drugs allows a relatively straightforward approach in preclinical and clinical studies, where the optimal dose has usually been defined as the "maximum tolerated dose" (MTD). This toxicity-based dosing approach is founded on the assumption that the therapeutic anticancer effect and toxic effects of the drug increase in parallel as the dose is escalated. On the contrary, most MTAs are expected to be more selective and less toxic than cytotoxic drugs. Consequently, the maximum therapeutic effect may be achieved at a "biologically effective dose" (BED) well below the MTD. Hence, dosing study for MTAs should be different from cytotoxic drugs. Enhanced efforts to molecularly characterize the drug efficacy for MTAs in preclinical models will be valuable for successfully designing dosing regimens for clinical trials.
A novel preclinical model combining experimental methods and theoretical analysis is proposed to investigate the mechanism of action and identify pharmacodynamic characteristics of the drug. Instead of fixed time point analysis of the drug exposure to drug effect, the time course of drug effect for different doses is quantitatively studied on cell line-based platforms using system identification, where tumor cells' responses to drugs through the use of fluorescent reporters are sampled over a time course. Results show that drug effect is time-varying and higher dosages induce faster and stronger responses as expected. However, the drug efficacy change along different dosages is not linear; on the contrary, there exist certain thresholds. This kind of preclinical study can provide valuable suggestions about dosing regimens for the in vivo experimental stage to increase productivity.
PMCID: PMC3481481  PMID: 23134733
4.  Causality, Randomness, Intelligibility, and the Epistemology of the Cell 
Current Genomics  2010;11(4):221-237.
Because the basic unit of biology is the cell, biological knowledge is rooted in the epistemology of the cell, and because life is the salient characteristic of the cell, its epistemology must be centered on its livingness, not its constituent components. The organization and regulation of these components in the pursuit of life constitute the fundamental nature of the cell. Thus, regulation sits at the heart of biological knowledge of the cell and the extraordinary complexity of this regulation conditions the kind of knowledge that can be obtained, in particular, the representation and intelligibility of that knowledge. This paper is essentially split into two parts. The first part discusses the inadequacy of everyday intelligibility and intuition in science and the consequent need for scientific theories to be expressed mathematically without appeal to commonsense categories of understanding, such as causality. Having set the backdrop, the second part addresses biological knowledge. It briefly reviews modern scientific epistemology from a general perspective and then turns to the epistemology of the cell. In analogy with a multi-faceted factory, the cell utilizes a highly parallel distributed control system to maintain its organization and regulate its dynamical operation in the face of both internal and external changes. Hence, scientific knowledge is constituted by the mathematics of stochastic dynamical systems, which model the overall relational structure of the cell and how these structures evolve over time, stochasticity being a consequence of the need to ignore a large number of factors while modeling relatively few in an extremely complex environment.
PMCID: PMC2930662  PMID: 21119887
Biology; causality; computational biology; epistemology; genomics; systems biology.
5.  Characterization of the Effectiveness of Reporting Lists of Small Feature Sets Relative to the Accuracy of the Prior Biological Knowledge 
Cancer Informatics  2010;9:49-60.
When confronted with a small sample, feature-selection algorithms often fail to find good feature sets, a problem exacerbated for high-dimensional data and large feature sets. The problem is compounded by the fact that, if one obtains a feature set with a low error estimate, the estimate is unreliable because training-data-based error estimators typically perform poorly on small samples, exhibiting optimistic bias or high variance. One way around the problem is limit the number of features being considered, restrict features sets to sizes such that all feature sets can be examined by exhaustive search, and report a list of the best performing feature sets. If the list is short, then it greatly restricts the possible feature sets to be considered as candidates; however, one can expect the lowest error estimates obtained to be optimistically biased so that there may not be a close-to-optimal feature set on the list. This paper provides a power analysis of this methodology; in particular, it examines the kind of results one should expect to obtain relative to the length of the list and the number of discriminating features among those considered. Two measures are employed. The first is the probability that there is at least one feature set on the list whose true classification error is within some given tolerance of the best feature set and the second is the expected number of feature sets on the list whose true errors are within the given tolerance of the best feature set. These values are plotted as functions of the list length to generate power curves. The results show that, if the number of discriminating features is not too small—that is, the prior biological knowledge is not too poor—then one should expect, with high probability, to find good feature sets.
Availability: companion website at
PMCID: PMC2865771  PMID: 20458361
classification; feature ranking; ranking power
6.  Optimization of oligonucleotide microarray fabricated by spotting 65-mer 
Analytical biochemistry  2007;368(1):61-69.
DNA microarrays currently provide measurements of sufficiently high quality to allow a wide variety of sound inferences about gene regulation and the coordination of cellular processes to be drawn. Nonetheless, a desire for greater precision in the measurements continues to drive the microarray research community to seek higher measurement quality through improvements in array fabrication and sample labeling and hybridization. We prepared oligonucleotide microarrays by printing 65-mer on aldehyde functional group derivatized slides as described in the previous study xxx. We could improve the reliability of data by removing enzymatic bias during probe labeling and hybridizing under a more stringent condition. This optimized method was used to profile gene expression patterns for nine different mouse tissues and organs, and MDS analysis of data showed both strong similarity between like samples and a clear, highly reproducible separation between different tissue samples. Three other microarrays were fabricated on commercial substrates and hybridized following the manufacturer’s instructions. The data was then compared with in-house microarray data and RT-PCR data. The microarray printed on the custom aldehyde slide was superior to microarrays printed on commercially available substrate slides in terms of signal intensities, background and hybridization characteristics. The data from the custom substrate microarray generally showed good agreement in quantitative changes up to 100-fold changes of transcript abundance with RT-PCR data. However, more accurate comparisons will be come as more genomic sequence information is gathered in public data domain.
PMCID: PMC2697255  PMID: 17618862
Analytical biochemistry  2007;368(1):70-78.
Microarray fabrication using pre-synthesized long oligonucleotide is becoming increasingly important, but a study of large-scale array productions is not published yet. We addressed the issue of fabricating oligonucleotide microarrays by spotting commercial, pre-synthesized 65-mers with 5′ amines representing 7500 murine genes. Amine-modified oligonucleotides were immobilized on glass slides having aldehyde groups via transient Schiff base formation followed by reduction to produce a covalent conjugate. When RNA derived from the same source was used for Cy3 and Cy5 labeling and hybridized to the same array, signal intensities spanning three orders of magnitude were observed, and the coefficient of variation between the two channels for all spots was 8–10%. To ascertain the reproducibility of ratio determination of these arrays, two triplicate hybridizations (with fluorochrome reversal) comparing RNAs from a fibroblast (NIH3T3) and a breast cancer (JC) cell line were carried out. The 95% confidence interval for all spots in the six hybridizations was 0.60 – 1.66. This level of reproducibility allows use of the full range of pattern finding and discriminant analysis typically applied to cDNA microarrays. Further comparative testing was carried out with oligonucleotide microarrays, cDNA microarrays and RT-PCR assays to examine the comparability of results across these different methodologies.
PMCID: PMC2697254  PMID: 17617369
8.  Validation of Computational Methods in Genomics 
Current Genomics  2007;8(1):1-19.
High-throughput technologies for genomics provide tens of thousands of genetic measurements, for instance, gene-expression measurements on microarrays, and the availability of these measurements has motivated the use of machine learning (inference) methods for classification, clustering, and gene networks. Generally, a design method will yield a model that satisfies some model constraints and fits the data in some manner. On the other hand, a scientific theory consists of two parts: (1) a mathematical model to characterize relations between variables, and (2) a set of relations between model variables and observables that are used to validate the model via predictive experiments. Although machine learning algorithms are constructed to hopefully produce valid scientific models, they do not ipso facto do so. In some cases, such as classifier estimation, there is a well-developed error theory that relates to model validity according to various statistical theorems, but in others such as clustering, there is a lack of understanding of the relationship between the learning algorithms and validation. The issue of validation is especially problematic in situations where the sample size is small in comparison with the dimensionality (number of variables), which is commonplace in genomics, because the convergence theory of learning algorithms is typically asymptotic and the algorithms often perform in counter-intuitive ways when used with samples that are small in relation to the number of variables. For translational genomics, validation is perhaps the most critical issue, because it is imperative that we understand the performance of a diagnostic or therapeutic procedure to be used in the clinic, and this performance relates directly to the validity of the model behind the procedure. This paper treats the validation issue as it appears in two classes of inference algorithms relating to genomics – classification and clustering. It formulates the problem and reviews salient results.
PMCID: PMC2474684  PMID: 18645624
9.  Normalization Benefits Microarray-Based Classification 
When using cDNA microarrays, normalization to correct labeling bias is a common preliminary step before further data analysis is applied, its objective being to reduce the variation between arrays. To date, assessment of the effectiveness of normalization has mainly been confined to the ability to detect differentially expressed genes. Since a major use of microarrays is the expression-based phenotype classification, it is important to evaluate microarray normalization procedures relative to classification. Using a model-based approach, we model the systemic-error process to generate synthetic gene-expression values with known ground truth. These synthetic expression values are subjected to typical normalization methods and passed through a set of classification rules, the objective being to carry out a systematic study of the effect of normalization on classification. Three normalization methods are considered: offset, linear regression, and Lowess regression. Seven classification rules are considered: 3-nearest neighbor, linear support vector machine, linear discriminant analysis, regular histogram, Gaussian kernel, perceptron, and multiple perceptron with majority voting. The results of the first three are presented in the paper, with the full results being given on a complementary website. The conclusion from the different experiment models considered in the study is that normalization can have a significant benefit for classification under difficult experimental conditions, with linear and Lowess regression slightly outperforming the offset method.
PMCID: PMC3171318  PMID: 18427588

Results 1-9 (9)