Search tips
Search criteria

Results 1-13 (13)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
1.  Intra-Tumoral Heterogeneity of HER2, FGFR2, cMET and ATM in Gastric Cancer: Optimizing Personalized Healthcare through Innovative Pathological and Statistical Analysis 
PLoS ONE  2015;10(11):e0143207.
Current drug development efforts on gastric cancer are directed against several molecular targets driving the growth of this neoplasm. Intra-tumoral biomarker heterogeneity however, commonly observed in gastric cancer, could lead to biased selection of patients. MET, ATM, FGFR2, and HER2 were profiled on gastric cancer biopsy samples. An innovative pathological assessment was performed through scoring of individual biopsies against whole biopsies from a single patient to enable heterogeneity evaluation. Following this, false negative risks for each biomarker were estimated in silico. 166 gastric cancer cases with multiple biopsies from single patients were collected from Shanghai Renji Hospital. Following pre-set criteria, 56 ~ 78% cases showed low, 15 ~ 35% showed medium and 0 ~ 11% showed high heterogeneity within the biomarkers profiled. If 3 biopsies were collected from a single patient, the false negative risk for detection of the biomarkers was close to 5% (exception for FGFR2: 12.2%). When 6 biopsies were collected, the false negative risk approached 0%. Our study demonstrates the benefit of multiple biopsy sampling when considering personalized healthcare biomarker strategy, and provides an example to address the challenge of intra-tumoral biomarker heterogeneity using alternative pathological assessment and statistical methods.
PMCID: PMC4654477  PMID: 26587992
2.  Data-Driven Information Extraction from Chinese Electronic Medical Records 
PLoS ONE  2015;10(8):e0136270.
This study aims to propose a data-driven framework that takes unstructured free text narratives in Chinese Electronic Medical Records (EMRs) as input and converts them into structured time-event-description triples, where the description is either an elaboration or an outcome of the medical event.
Materials and Methods
Our framework uses a hybrid approach. It consists of constructing cross-domain core medical lexica, an unsupervised, iterative algorithm to accrue more accurate terms into the lexica, rules to address Chinese writing conventions and temporal descriptors, and a Support Vector Machine (SVM) algorithm that innovatively utilizes Normalized Google Distance (NGD) to estimate the correlation between medical events and their descriptions.
The effectiveness of the framework was demonstrated with a dataset of 24,817 de-identified Chinese EMRs. The cross-domain medical lexica were capable of recognizing terms with an F1-score of 0.896. 98.5% of recorded medical events were linked to temporal descriptors. The NGD SVM description-event matching achieved an F1-score of 0.874. The end-to-end time-event-description extraction of our framework achieved an F1-score of 0.846.
In terms of named entity recognition, the proposed framework outperforms state-of-the-art supervised learning algorithms (F1-score: 0.896 vs. 0.886). In event-description association, the NGD SVM is superior to SVM using only local context and semantic features (F1-score: 0.874 vs. 0.838).
The framework is data-driven, weakly supervised, and robust against the variations and noises that tend to occur in a large corpus. It addresses Chinese medical writing conventions and variations in writing styles through patterns used for discovering new terms and rules for updating the lexica.
PMCID: PMC4546596  PMID: 26295801
3.  Patient-Derived Gastric Carcinoma Xenograft Mouse Models Faithfully Represent Human Tumor Molecular Diversity 
PLoS ONE  2015;10(7):e0134493.
Patient-derived cancer xenografts (PDCX) generally represent more reliable models of human disease in which to evaluate a potential drugs preclinical efficacy. However to date, only a few patient-derived gastric cancer xenograft (PDGCX) models have been reported. In this study, we aimed to establish additional PDGCX models and to evaluate whether these models accurately reflected the histological and genetic diversities of the corresponding patient tumors. By engrafting fresh patient gastric cancer (GC) tissues into immune-compromised mice (SCID and/or nude mice), thirty two PDGCX models were established. Histological features were assessed by a qualified pathologist based on H&E staining. Genomic comparison was performed for several biomarkers including ERBB1, ERBB2, ERBB3, FGFR2, MET and PTEN. These biomarkers were profiled to assess gene copy number by fluorescent in situ hybridization (FISH) and/or protein expression by immunohistochemistry (IHC). All 32 PDGCX models retained the histological features of the corresponding human tumors. Furthermore, among the 32 models, 78% (25/32) highly expressed ERBB1 (EGFR), 22% (7/32) were ERBB2 (HER2) positive, 78% (25/32) showed ERBB3 (HER3) high expression, 66% (21/32) lost PTEN expression, 3% (1/32) harbored FGFR2 amplification, 41% (13/32) were positive for MET expression and 16% (5/32) were MET gene amplified. Between the PDGCX models and their parental tumors, a high degree of similarity was observed for FGFR2 and MET gene amplification, and also for ERBB2 status (agreement rate = 94~100%; kappa value = 0.81~1). Protein expression of PTEN and MET also showed moderate agreement (agreement rate = 78%; kappa value = 0.46~0.56), while ERBB1 and ERBB3 expression showed slight agreement (agreement rate = 59~75%; kappa value = 0.18~0.19). ERBB2 positivity, FGFR2 or MET gene amplification was all maintained until passage 12 in mice. The stability of the molecular profiles observed across subsequent passages within the individual models provides confidence in the utility and translational significance of these models for in vivo testing of personalized therapies.
PMCID: PMC4517891  PMID: 26217940
4.  Recruitment Challenges of a Multicenter Randomized Controlled Varicocelectomy Trial 
Fertility and sterility  2011;96(6):1299-1305.
Study Objective
To review reasons for suboptimal recruitment for a randomized controlled trial (RCT) of varicocelectomy vs. intrauterine insemination for treatment of male infertility, and suggest means to improve future study recruitment.
A survey of RMN participating sites.
The Reproductive Medicine Network.
Main Outcome Measures
Ascertain reasons for inadequate recruitment and suggest improvements for future varicocelectomy trails.
This study screened 7 and enrolled 3 couples with the first couple randomized on 6/30/2010. The study was subsequently stopped on 03/30/2011. The following themes were cited most frequently by sites and therefore determined to be most likely to have played a role in suboptimal recruitment: (1) men must be screened at the beginning of a couple's infertility evaluation, (2) inclusion of infertile women who have failed previous fertility interventions appeared to be associated with couple intolerance of a placebo arm, and (3) there appeared to be bias against the use of unstimulated IUI cycles, indicating a prejudicial preference for surgical intervention in the male partner.
Improved recruitment may be realized through screening infertile men as early as possible while minimizing study-related time commitments. Focused patient education may promote improved ‘equipoise’ and acceptance of a placebo arm in male infertility studies. Lastly, creative approaches to implementing varicocelectomy trials must be considered in addition to having a network of motivated researchers who carry a high volume of possible study participants, as screening of very large numbers may be needed to complete clinical trial enrollment. NCT00767338.
PMCID: PMC3243664  PMID: 22130101
Recruitment; consent; randomization; accrual; enrollment; prospective; varicocele; varicocelectomy
5.  Detecting Genes and Gene-gene Interactions for Age-related Macular Degeneration with a Forest-based Approach 
Age-related macular degeneration (AMD) is a leading cause of vision loss in the elderly. Genetic mechanisms underlying AMD are complex. Understanding the etiology of AMD is important because of the significant health and social concerns. In this paper, we describe a forest-based approach to systematically identifying multiple genes, gene-gene interactions and gene-environment interactions underlying complex diseases in genomewide case-control studies and the application of this approach to a published data set on AMD. Our analysis not only confirmed two known haplotypes, ACTCCG (on chromosome 1 with a p-value of 1.98e-6) and TCTGGACGACA (on chromosome 7 with a p-value of 9.81e-3), but also revealed two novel haplotypes, GATAGT (on chromosome 5 with a p-value of 3.46e-3) and TCTTACGTAGA (on chromosome 12 with a p-value of 3.16e-2). Thus, the significance of this work is twofold. First, we propose a powerful and robust method to identify high-risk haplotypes and their interactions; second, we reveal potential genetic variants associated with AMD.
PMCID: PMC2799940  PMID: 20161521
Age-related macular degeneration; Genomewide association; Haplotype; Interaction; Random Forest
6.  Using Mutual Information to Discover Temporal Patterns in Gene Expression Data 
Finding relations among gene expressions involves the definition of the similarity between experimental data. A simplest similarity measure is the Correlation Coefficient. It is able to identify linear dependences only; moreover, is sensitive to experimental errors. An alternative measure, the Shannon Mutual Information (MI), is free from the above mentioned weaknesses. However, the calculation of MI for continuous variables from the finite number of experimental points, N, involves an ambiguity arising when one divides the range of values of the continuous variable into boxes. Then the distribution of experimental points among the boxes (and, therefore, MI) depends on the box size. An algorithm for the calculation of MI for continuous variables is proposed. We find the optimum box sizes for a given N from the condition of minimum entropy variation with respect to the change of the box sizes. We have applied this technique to the gene expression dataset from Stanford, containing microarray data at 18 time points from yeast Saccharomyces cerevisiae cultures (Spellman et al.,[3]). We calculated MI for all of the pairs of time points. The MI analysis allowed us to identify time patterns related to different biological processes in the cell.
PMCID: PMC2860312  PMID: 20428481
Gene expression; Mutual information
7.  Detecting significant single-nucleotide polymorphisms in a rheumatoid arthritis study using random forests 
BMC Proceedings  2009;3(Suppl 7):S69.
Random forest is an efficient approach for investigating not only the effects of individual markers on a trait but also the effect of the interactions among the markers in genetic association studies. This approach is especially appealing for the analysis of genome-wide data, such as those obtained from gene expression/single-nucleotide polymorphism (SNP) array experiments in which the number of candidate genes/SNPs is vast. We applied this approach to the Genetic Analysis Workshop 16 Problem 1 data to identify SNPs that contribute to rheumatoid arthritis. The random forest computed a raw importance score for each SNP marker, where higher importance score suggests higher level of association between the marker and the trait. The significance level of the association was determined empirically by repeatedly reapplying the random forest on randomly generated data under the null hypothesis that no association exists between the markers and the trait. Using random forest, we were able to identify 228 significant SNPs (at the genome-wide significant level of 0.05) across the whole genome, over two-thirds of which are located on chromosome 6, especially clustered in the region of 6p21 containing the human leukocyte antigen (HLA) genes, such as gene HLA-DRB1 and HLA-DRA. Further analysis of this region indicates a strong association to the rheumatoid arthritis status.
PMCID: PMC2795970  PMID: 20018063
8.  Memory management in genome-wide association studies 
BMC Proceedings  2009;3(Suppl 7):S54.
Genome-wide association is a powerful tool for the identification of genes that underlie common diseases. Genome-wide association studies generate billions of genotypes and pose significant computational challenges for most users including limited computer memory. We applied a recently developed memory management tool to two analyses of North American Rheumatoid Arthritis Consortium studies and measured the performance in terms of central processing unit and memory usage. We conclude that our memory management approach is simple, efficient, and effective for genome-wide association studies.
PMCID: PMC2795954  PMID: 20018047
9.  A genome-wide association analysis of Framingham Heart Study longitudinal data using multivariate adaptive splines 
BMC Proceedings  2009;3(Suppl 7):S119.
The Framingham Heart Study is a well known longitudinal cohort study. In recent years, the community-based Framingham Heart Study has embarked on genome-wide association studies. In this paper, we present a Framingham Heart Study genome-wide analysis for fasting triglycerides trait in the Genetic Analysis Workshop16 Problem 2 using multivariate adaptive splines for the analysis of longitudinal data (MASAL). With MASAL, we are able to perform analysis of genome-wide data with longitudinal phenotypes and covariates, making it possible to identify genes, gene-gene, and gene-environment (including time) interactions associated with the trait of interest. We conducted a permutation test to assess the associations between MASAL selected markers and triglycerides trait and report significant gene-gene and gene-environment interaction effects on the trait of interest.
PMCID: PMC2795891  PMID: 20017984
10.  LOT: a tool for linkage analysis of ordinal traits for pedigree data 
Bioinformatics  2008;24(15):1737-1739.
Summary: Existing linkage-analysis methods address binary or quantitative traits. However, many complex diseases and human conditions, particularly behavioral disorders, are rated on ordinal scales. Herein, we introduce, LOT, a tool that performs linkage analysis of ordinal traits for pedigree data. It implements a latent-variable proportional-odds logistic model that relates inheritance patterns to the distribution of the ordinal trait. The likelihood-ratio test is used for testing evidence of linkage.
Availability: The LOT program is available for download at
PMCID: PMC2566542  PMID: 18535081
11.  Analysis of Twin Data Using SAS 
Biometrics  2008;65(2):584-589.
Twin studies are essential for assessing disease inheritance. Data generated from twin studies are traditionally analyzed using specialized computational programs. For many researchers, especially those who are new to twin studies, understanding and using those specialized computational programs can be a daunting task. Given that SAS is the most popular software for statistical analysis, we suggest the use of SAS procedures for twin data may be a helpful alternative and demonstrate that we can obtain similar results from SAS to those produced by specialized computational programs. This numerical validation is practically useful, because a natural concern with general statistical software is whether it can deal with data that are generated from special study designs such as twin studies and whether it can test a particular hypothesis. We conclude through our extensive simulation that SAS procedures can be used easily as a very convenient alternative to specialized programs for twin data analysis.
PMCID: PMC2700843  PMID: 18647295
Twin study; Variance components method; Heritability; Generalized linear mixed model; SAS PROC MIXED; SAS PROC NLMIXED
12.  Ultraspecific probes for high throughput HLA typing 
BMC Genomics  2009;10:85.
The variations within an individual's HLA (Human Leukocyte Antigen) genes have been linked to many immunological events, e.g. susceptibility to disease, response to vaccines, and the success of blood, tissue, and organ transplants. Although the microarray format has the potential to achieve high-resolution typing, this has yet to be attained due to inefficiencies of current probe design strategies.
We present a novel three-step approach for the design of high-throughput microarray assays for HLA typing. This approach first selects sequences containing the SNPs present in all alleles of the locus of interest and next calculates the number of base changes necessary to convert a candidate probe sequences to the closest subsequence within the set of sequences that are likely to be present in the sample including the remainder of the human genome in order to identify those candidate probes which are "ultraspecific" for the allele of interest. Due to the high specificity of these sequences, it is possible that preliminary steps such as PCR amplification are no longer necessary. Lastly, the minimum number of these ultraspecific probes is selected such that the highest resolution typing can be achieved for the minimal cost of production. As an example, an array was designed and in silico results were obtained for typing of the HLA-B locus.
The assay presented here provides a higher resolution than has previously been developed and includes more alleles than previously considered. Based upon the in silico and preliminary experimental results, we believe that the proposed approach can be readily applied to any highly polymorphic gene system.
PMCID: PMC2661095  PMID: 19232123
13.  LOT 
Bioinformatics (Oxford, England)  2008;24(15):1737-1739.
Existing linkage-analysis methods address binary or quantitative traits. However, many complex diseases and human conditions, particularly behavioral disorders, are rated on ordinal scales. Herein, we introduce, LOT, a tool that performs linkage analysis of ordinal traits for pedigree data. It implements a latent-variable proportional-odds logistic model that relates inheritance patterns to the distribution of the ordinal trait. The likelihood-ratio test is used for testing evidence of linkage.
PMCID: PMC2566542  PMID: 18535081

Results 1-13 (13)