Search tips
Search criteria

Results 1-10 (10)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
author:("Yuan, xigu")
1.  Random Subspace Aggregation for Cancer Prediction with Gene Expression Profiles 
BioMed Research International  2016;2016:4596326.
Background. Precisely predicting cancer is crucial for cancer treatment. Gene expression profiles make it possible to analyze patterns between genes and cancers on the genome-wide scale. Gene expression data analysis, however, is confronted with enormous challenges for its characteristics, such as high dimensionality, small sample size, and low Signal-to-Noise Ratio. Results. This paper proposes a method, termed RS_SVM, to predict gene expression profiles via aggregating SVM trained on random subspaces. After choosing gene features through statistical analysis, RS_SVM randomly selects feature subsets to yield random subspaces and training SVM classifiers accordingly and then aggregates SVM classifiers to capture the advantage of ensemble learning. Experiments on eight real gene expression datasets are performed to validate the RS_SVM method. Experimental results show that RS_SVM achieved better classification accuracy and generalization performance in contrast with single SVM, K-nearest neighbor, decision tree, Bagging, AdaBoost, and the state-of-the-art methods. Experiments also explored the effect of subspace size on prediction performance. Conclusions. The proposed RS_SVM method yielded superior performance in analyzing gene expression profiles, which demonstrates that RS_SVM provides a good channel for such biological data.
PMCID: PMC5143691  PMID: 27999797
2.  FHSA-SED: Two-Locus Model Detection for Genome-Wide Association Study with Harmony Search Algorithm 
PLoS ONE  2016;11(3):e0150669.
Two-locus model is a typical significant disease model to be identified in genome-wide association study (GWAS). Due to intensive computational burden and diversity of disease models, existing methods have drawbacks on low detection power, high computation cost, and preference for some types of disease models.
In this study, two scoring functions (Bayesian network based K2-score and Gini-score) are used for characterizing two SNP locus as a candidate model, the two criteria are adopted simultaneously for improving identification power and tackling the preference problem to disease models. Harmony search algorithm (HSA) is improved for quickly finding the most likely candidate models among all two-locus models, in which a local search algorithm with two-dimensional tabu table is presented to avoid repeatedly evaluating some disease models that have strong marginal effect. Finally G-test statistic is used to further test the candidate models.
We investigate our method named FHSA-SED on 82 simulated datasets and a real AMD dataset, and compare it with two typical methods (MACOED and CSE) which have been developed recently based on swarm intelligent search algorithm. The results of simulation experiments indicate that our method outperforms the two compared algorithms in terms of detection power, computation time, evaluation times, sensitivity (TPR), specificity (SPC), positive predictive value (PPV) and accuracy (ACC). Our method has identified two SNPs (rs3775652 and rs10511467) that may be also associated with disease in AMD dataset.
PMCID: PMC4807955  PMID: 27014873
3.  Simulating Linkage Disequilibrium Structures in a Human Population for SNP Association Studies 
Biochemical genetics  2011;49(0):395-409.
Existing simulation methods usually simulate linkage disequilibrium (LD) structures starting with an initial population that is randomly generated according to specified allele frequencies. These at random based methods might be unstable because the LD level of the initial population is generally extremely low. This study presents a new algorithm, SIMLD, to simulate genome populations with real LD structures. SIMLD begins from an initial population with possibly the highest LD level, and then the LD decays to fit the desired level through processes of mating and recombination over generations. SIMLD can produce case–control samples according to various disease models. Using empirical SNP marker information from three populations of HapMap data, we implement the proposed algorithm and demonstrate a set of experimental results.
PMCID: PMC4116680  PMID: 21234669
Case–control; Disease models; Linkage disequilibrium; Simulation; SNPs
4.  Identification of Molecular Pathway Aberrations in Uterine Serous Carcinoma by Genome-wide Analyses 
Uterine cancer is the fourth most common malignancy in women, and uterine serous carcinoma is the most aggressive subtype. However, the molecular pathogenesis of uterine serous carcinoma is largely unknown. We analyzed the genomes of uterine serous carcinoma samples to better understand the molecular genetic characteristics of this cancer.
Whole-exome sequencing was performed on 10 uterine serous carcinomas and the matched normal blood or tissue samples. Somatically acquired sequence mutations were further verified by Sanger sequencing. The most frequent molecular genetic changes were further validated by Sanger sequencing in 66 additional uterine serous carcinomas and in nine serous endometrial intraepithelial carcinomas (the preinvasive precursor of uterine serous carcinoma) that were isolated by laser capture microdissection. In addition, gene copy number was characterized by single-nucleotide polymorphism (SNP) arrays in 23 uterine serous carcinomas, including 10 that were subjected to whole-exome sequencing.
We found frequent somatic mutations in TP53 (81.6%), PIK3CA (23.7%), FBXW7 (19.7%), and PPP2R1A (18.4%) among the 76 uterine serous carcinomas examined. All nine serous carcinomas that had an associated serous endometrial intraepithelial carcinoma had concordant PIK3CA, PPP2R1A, and TP53 mutation status between uterine serous carcinoma and the concurrent serous endometrial intraepithelial carcinoma component. DNA copy number analysis revealed frequent genomic amplification of the CCNE1 locus (which encodes cyclin E, a known substrate of FBXW7) and deletion of the FBXW7 locus. Among 23 uterine serous carcinomas that were subjected to SNP array analysis, seven tumors with FBXW7 mutations (four tumors with point mutations, three tumors with hemizygous deletions) did not have CCNE1 amplification, and 13 (57%) tumors had either a molecular genetic alteration in FBXW7 or CCNE1 amplification. Nearly half of these uterine serous carcinomas (48%) harbored PIK3CA mutation and/or PIK3CA amplification.
Molecular genetic aberrations involving the p53, cyclin E–FBXW7, and PI3K pathways represent major mechanisms in the development of uterine serous carcinoma.
PMCID: PMC3692380  PMID: 22923510
5.  An Overview of Population Genetic Data Simulation 
Simulation studies in population genetics play an important role in helping to better understand the impact of various evolutionary and demographic scenarios on sequence variation and sequence patterns, and they also permit investigators to better assess and design analytical methods in the study of disease-associated genetic factors. To facilitate these studies, it is imperative to develop simulators with the capability to accurately generate complex genomic data under various genetic models. Currently, a number of efficient simulation software packages for large-scale genomic data are available, and new simulation programs with more sophisticated capabilities and features continue to emerge. In this article, we review the three basic simulation frameworks—coalescent, forward, and resampling—and some of the existing simulators that fall under these frameworks, comparing them with respect to their evolutionary and demographic scenarios, their computational complexity, and their specific applications. Additionally, we address some limitations in current simulation algorithms and discuss future challenges in the development of more powerful simulation tools.
PMCID: PMC3244809  PMID: 22149682
backward simulators; disease association study; forward simulators; genome simulation; resampling
6.  Comparative Analysis of Methods for Identifying Recurrent Copy Number Alterations in Cancer 
PLoS ONE  2012;7(12):e52516.
Recurrent copy number alterations (CNAs) play an important role in cancer genesis. While a number of computational methods have been proposed for identifying such CNAs, their relative merits remain largely unknown in practice since very few efforts have been focused on comparative analysis of the methods. To facilitate studies of recurrent CNA identification in cancer genome, it is imperative to conduct a comprehensive comparison of performance and limitations among existing methods. In this paper, six representative methods proposed in the latest six years are compared. These include one-stage and two-stage approaches, working with raw intensity ratio data and discretized data respectively. They are based on various techniques such as kernel regression, correlation matrix diagonal segmentation, semi-parametric permutation and cyclic permutation schemes. We explore multiple criteria including type I error rate, detection power, Receiver Operating Characteristics (ROC) curve and the area under curve (AUC), and computational complexity, to evaluate performance of the methods under multiple simulation scenarios. We also characterize their abilities on applications to two real datasets obtained from cancers with lung adenocarcinoma and glioblastoma. This comparison study reveals general characteristics of the existing methods for identifying recurrent CNAs, and further provides new insights into their strengths and weaknesses. It is believed helpful to accelerate the development of novel and improved methods.
PMCID: PMC3527554  PMID: 23285074
7.  Genome-wide identification of significant aberrations in cancer genome 
BMC Genomics  2012;13:342.
Somatic Copy Number Alterations (CNAs) in human genomes are present in almost all human cancers. Systematic efforts to characterize such structural variants must effectively distinguish significant consensus events from random background aberrations. Here we introduce Significant Aberration in Cancer (SAIC), a new method for characterizing and assessing the statistical significance of recurrent CNA units. Three main features of SAIC include: (1) exploiting the intrinsic correlation among consecutive probes to assign a score to each CNA unit instead of single probes; (2) performing permutations on CNA units that preserve correlations inherent in the copy number data; and (3) iteratively detecting Significant Copy Number Aberrations (SCAs) and estimating an unbiased null distribution by applying an SCA-exclusive permutation scheme.
We test and compare the performance of SAIC against four peer methods (GISTIC, STAC, KC-SMART, CMDS) on a large number of simulation datasets. Experimental results show that SAIC outperforms peer methods in terms of larger area under the Receiver Operating Characteristics curve and increased detection power. We then apply SAIC to analyze structural genomic aberrations acquired in four real cancer genome-wide copy number data sets (ovarian cancer, metastatic prostate cancer, lung adenocarcinoma, glioblastoma). When compared with previously reported results, SAIC successfully identifies most SCAs known to be of biological significance and associated with oncogenes (e.g., KRAS, CCNE1, and MYC) or tumor suppressor genes (e.g., CDKN2A/B). Furthermore, SAIC identifies a number of novel SCAs in these copy number data that encompass tumor related genes and may warrant further studies.
Supported by a well-grounded theoretical framework, SAIC has been developed and used to identify SCAs in various cancer copy number data sets, providing useful information to study the landscape of cancer genomes. Open–source and platform-independent SAIC software is implemented using C++, together with R scripts for data formatting and Perl scripts for user interfacing, and it is easy to install and efficient to use. The source code and documentation are freely available at
PMCID: PMC3428679  PMID: 22839576
8.  TAGCNA: A Method to Identify Significant Consensus Events of Copy Number Alterations in Cancer 
PLoS ONE  2012;7(7):e41082.
Somatic copy number alteration (CNA) is a common phenomenon in cancer genome. Distinguishing significant consensus events (SCEs) from random background CNAs in a set of subjects has been proven to be a valuable tool to study cancer. In order to identify SCEs with an acceptable type I error rate, better computational approaches should be developed based on reasonable statistics and null distributions. In this article, we propose a new approach named TAGCNA for identifying SCEs in somatic CNAs that may encompass cancer driver genes. TAGCNA employs a peel-off permutation scheme to generate a reasonable null distribution based on a prior step of selecting tag CNA markers from the genome being considered. We demonstrate the statistical power of TAGCNA on simulated ground truth data, and validate its applicability using two publicly available cancer datasets: lung and prostate adenocarcinoma. TAGCNA identifies SCEs that are known to be involved with proto-oncogenes (e.g. EGFR, CDK4) and tumor suppressor genes (e.g. CDKN2A, CDKN2B), and provides many additional SCEs with potential biological relevance in these data. TAGCNA can be used to analyze the significance of CNAs in various cancers. It is implemented in R and is freely available at
PMCID: PMC3399811  PMID: 22815924
9.  Comparative analysis of methods for detecting interacting loci 
BMC Genomics  2011;12:344.
Interactions among genetic loci are believed to play an important role in disease risk. While many methods have been proposed for detecting such interactions, their relative performance remains largely unclear, mainly because different data sources, detection performance criteria, and experimental protocols were used in the papers introducing these methods and in subsequent studies. Moreover, there have been very few studies strictly focused on comparison of existing methods. Given the importance of detecting gene-gene and gene-environment interactions, a rigorous, comprehensive comparison of performance and limitations of available interaction detection methods is warranted.
We report a comparison of eight representative methods, of which seven were specifically designed to detect interactions among single nucleotide polymorphisms (SNPs), with the last a popular main-effect testing method used as a baseline for performance evaluation. The selected methods, multifactor dimensionality reduction (MDR), full interaction model (FIM), information gain (IG), Bayesian epistasis association mapping (BEAM), SNP harvester (SH), maximum entropy conditional probability modeling (MECPM), logistic regression with an interaction term (LRIT), and logistic regression (LR) were compared on a large number of simulated data sets, each, consistent with complex disease models, embedding multiple sets of interacting SNPs, under different interaction models. The assessment criteria included several relevant detection power measures, family-wise type I error rate, and computational complexity. There are several important results from this study. First, while some SNPs in interactions with strong effects are successfully detected, most of the methods miss many interacting SNPs at an acceptable rate of false positives. In this study, the best-performing method was MECPM. Second, the statistical significance assessment criteria, used by some of the methods to control the type I error rate, are quite conservative, thereby limiting their power and making it difficult to fairly compare them. Third, as expected, power varies for different models and as a function of penetrance, minor allele frequency, linkage disequilibrium and marginal effects. Fourth, the analytical relationships between power and these factors are derived, aiding in the interpretation of the study results. Fifth, for these methods the magnitude of the main effect influences the power of the tests. Sixth, most methods can detect some ground-truth SNPs but have modest power to detect the whole set of interacting SNPs.
This comparison study provides new insights into the strengths and limitations of current methods for detecting interacting loci. This study, along with freely available simulation tools we provide, should help support development of improved methods. The simulation tools are available at:
PMCID: PMC3161015  PMID: 21729295
10.  Probability Theory-based SNP Association Study Method for Identifying Susceptibility Loci and Genetic Disease Models in Human Case-Control Data 
One of the most challenging points in studying human common complex diseases is to search for both strong and weak susceptibility single-nucleotide polymorphisms (SNPs) and identify forms of genetic disease models. Currently, a number of methods have been proposed for this purpose. Many of them have not been validated through applications into various genome datasets, so their abilities are not clear in real practice. In this paper, we present a novel SNP association study method based on probability theory, called ProbSNP. The method firstly detects SNPs by evaluating their joint probabilities in combining with disease status and selects those with the lowest joint probabilities as susceptibility ones, and then identifies some forms of genetic disease models through testing multiple-locus interactions among the selected SNPs. The joint probabilities of combined SNPs are estimated by establishing Gaussian distribution probability density functions, in which the related parameters (i.e., mean value and standard deviation) are evaluated based on allele and haplotype frequencies. Finally, we test and validate the method using various genome datasets. We find that ProbSNP has shown remarkable success in the applications to both simulated genome data and real genome-wide data.
PMCID: PMC3029504  PMID: 20840904
Association study; SNPs; probability theory; Gaussian distribution; case-control

Results 1-10 (10)