1.  How Good Is Crude MDL for Solving the Bias-Variance Dilemma? An Empirical Investigation Based on Bayesian Networks 
PLoS ONE  2014;9(3):e92866.
The bias-variance dilemma is a well-known and important problem in Machine Learning. It basically relates the generalization capability (goodness of fit) of a learning method to its corresponding complexity. When we have enough data at hand, it is possible to use these data in such a way so as to minimize overfitting (the risk of selecting a complex model that generalizes poorly). Unfortunately, there are many situations where we simply do not have this required amount of data. Thus, we need to find methods capable of efficiently exploiting the available data while avoiding overfitting. Different metrics have been proposed to achieve this goal: the Minimum Description Length principle (MDL), Akaike’s Information Criterion (AIC) and Bayesian Information Criterion (BIC), among others. In this paper, we focus on crude MDL and empirically evaluate its performance in selecting models with a good balance between goodness of fit and complexity: the so-called bias-variance dilemma, decomposition or tradeoff. Although the graphical interaction between these dimensions (bias and variance) is ubiquitous in the Machine Learning literature, few works present experimental evidence to recover such interaction. In our experiments, we argue that the resulting graphs allow us to gain insights that are difficult to unveil otherwise: that crude MDL naturally selects balanced models in terms of bias-variance, which not necessarily need be the gold-standard ones. We carry out these experiments using a specific model: a Bayesian network. In spite of these motivating results, we also should not overlook three other components that may significantly affect the final model selection: the search procedure, the noise rate and the sample size.
PMCID: PMC3966834  PMID: 24671204
2.  Inferring the Temporal Order of Cancer Gene Mutations in Individual Tumor Samples 
PLoS ONE  2014;9(2):e89244.
The temporal order of cancer gene mutations in tumors is essential for understanding and treating the disease. Existing methods are unable to infer the order of mutations that are identified at the same time in individual tumor samples, leaving the heterogeneity of the order unknown. Here, we show that through a complex network-based approach, which is based on the newly defined statistic –carcinogenesis information conductivity (CIC), the temporal order in individual samples can be effectively inferred. The results suggest that tumor-suppressor genes might more frequently initiate the order of mutations than oncogenes, and every type of cancer might have its own unique order of mutations. The initial mutations appear to be dedicated to acquiring the function of evading apoptosis, and some order constraints might reflect potential regularities. Our approach is completely data-driven without any parameter settings and can be expected to become more effective as more data will become available.
PMCID: PMC3937336  PMID: 24586626
3.  A Novel Toxicokinetic Modeling of Cypermethrin and Permethrin and Their Metabolites in Humans for Dose Reconstruction from Biomarker Data 
PLoS ONE  2014;9(2):e88517.
To assess exposure to pyrethroids in the general population, one of most widely used method nowadays consists of measuring urinary metabolites. Unfortunately, interpretation of data is limited by the unspecified relation between dose and levels in biological tissues and excreta. The objective of this study was to develop a common multi-compartment toxicokinetic model to predict the time courses of two mainly used pyrethroid pesticides, permethrin and cypermethrin, and their metabolites (cis-DCCA, trans-DCCA and 3-PBA) in the human body and in accessible biological matrices following different exposure scenarios. Toxicokinetics was described mathematically by systems of differential equations to yield the time courses of these pyrethroids and their metabolites in the different compartments. Unknown transfer rate values between compartments were determined from best fits to available human data on the urinary excretion time courses of metabolites following an oral and dermal exposure to cypermethrin in volunteers. Since values for these coefficients have not yet been determined, a mathematical routine was programmed in MathCad to establish the possible range of values on the basis of physiological and mathematical considerations. The best combination of parameter values was then selected using a statistic measure (reliability factor) along with a statistically acceptable range of values for each parameter. With this approach, simulations provided a close approximation to published time course data. This model allows to predict urinary time courses of trans-DCCA, cis-DCCA and 3-PBA, whatever the exposure route. It can also serve to reconstruct absorbed doses of permethrin or cypermethrin in the population using measured biomarker data.
PMCID: PMC3935837  PMID: 24586336
4.  ENZYMAP: Exploiting Protein Annotation for Modeling and Predicting EC Number Changes in UniProt/Swiss-Prot 
PLoS ONE  2014;9(2):e89162.
The volume and diversity of biological data are increasing at very high rates. Vast amounts of protein sequences and structures, protein and genetic interactions and phenotype studies have been produced. The majority of data generated by high-throughput devices is automatically annotated because manually annotating them is not possible. Thus, efficient and precise automatic annotation methods are required to ensure the quality and reliability of both the biological data and associated annotations. We proposed ENZYMatic Annotation Predictor (ENZYMAP), a technique to characterize and predict EC number changes based on annotations from UniProt/Swiss-Prot using a supervised learning approach. We evaluated ENZYMAP experimentally, using test data sets from both UniProt/Swiss-Prot and UniProt/TrEMBL, and showed that predicting EC changes using selected types of annotation is possible. Finally, we compared ENZYMAP and DETECT with respect to their predictions and checked both against the UniProt/Swiss-Prot annotations. ENZYMAP was shown to be more accurate than DETECT, coming closer to the actual changes in UniProt/Swiss-Prot. Our proposal is intended to be an automatic complementary method (that can be used together with other techniques like the ones based on protein sequence and structure) that helps to improve the quality and reliability of enzyme annotations over time, suggesting possible corrections, anticipating annotation changes and propagating the implicit knowledge for the whole dataset.
PMCID: PMC3929618  PMID: 24586563
5.  Redundancy-Aware Topic Modeling for Patient Record Notes 
PLoS ONE  2014;9(2):e87555.
The clinical notes in a given patient record contain much redundancy, in large part due to clinicians’ documentation habit of copying from previous notes in the record and pasting into a new note. Previous work has shown that this redundancy has a negative impact on the quality of text mining and topic modeling in particular. In this paper we describe a novel variant of Latent Dirichlet Allocation (LDA) topic modeling, Red-LDA, which takes into account the inherent redundancy of patient records when modeling content of clinical notes. To assess the value of Red-LDA, we experiment with three baselines and our novel redundancy-aware topic modeling method: given a large collection of patient records, (i) apply vanilla LDA to all documents in all input records; (ii) identify and remove all redundancy by chosing a single representative document for each record as input to LDA; (iii) identify and remove all redundant paragraphs in each record, leaving partial, non-redundant documents as input to LDA; and (iv) apply Red-LDA to all documents in all input records. Both quantitative evaluation carried out through log-likelihood on held-out data and topic coherence of produced topics and qualitative assessement of topics carried out by physicians show that Red-LDA produces superior models to all three baseline strategies. This research contributes to the emerging field of understanding the characteristics of the electronic health record and how to account for them in the framework of data mining. The code for the two redundancy-elimination baselines and Red-LDA is made publicly available to the community.
PMCID: PMC3923754  PMID: 24551060
6.  Modelling Pathways to Rubisco Degradation: A Structural Equation Network Modelling Approach 
PLoS ONE  2014;9(2):e87597.
‘Omics analysis (transcriptomics, proteomics) quantifies changes in gene/protein expression, providing a snapshot of changes in biochemical pathways over time. Although tools such as modelling that are needed to investigate the relationships between genes/proteins already exist, they are rarely utilised. We consider the potential for using Structural Equation Modelling to investigate protein-protein interactions in a proposed Rubisco protein degradation pathway using previously published data from 2D electrophoresis and mass spectrometry proteome analysis. These informed the development of a prior model that hypothesised a pathway of Rubisco Large Subunit and Small Subunit degradation, producing both primary and secondary degradation products. While some of the putative pathways were confirmed by the modelling approach, the model also demonstrated features that had not been originally hypothesised. We used Bayesian analysis based on Markov Chain Monte Carlo simulation to generate output statistics suggesting that the model had replicated the variation in the observed data due to protein-protein interactions. This study represents an early step in the development of approaches that seek to enable the full utilisation of information regarding the dynamics of biochemical pathways contained within proteomics data. As these approaches gain attention, they will guide the design and conduct of experiments that enable ‘Omics modelling to become a common place practice within molecular biology.
PMCID: PMC3911993  PMID: 24498339
7.  Regression-Based Ranking of Pathogen Strains with Respect to Their Contribution to Natural Epidemics 
PLoS ONE  2014;9(1):e86591.
Genetic variation in pathogen populations may be an important factor driving heterogeneity in disease dynamics within their host populations. However, to date, we understand poorly how genetic diversity in diseases impact on epidemiological dynamics because data and tools required to answer this questions are lacking. Here, we combine pathogen genetic data with epidemiological monitoring of disease progression, and introduce a statistical exploratory method to investigate differences among pathogen strains in their performance in the field. The method exploits epidemiological data providing a measure of disease progress in time and space, and genetic data indicating the relative spatial patterns of the sampled pathogen strains. Applying this method allows to assign ranks to the pathogen strains with respect to their contributions to natural epidemics and to assess the significance of the ranking. This method was first tested on simulated data, including data obtained from an original, stochastic, multi-strain epidemic model. It was then applied to epidemiological and genetic data collected during one natural epidemic of powdery mildew occurring in its wild host population. Based on the simulation study, we conclude that the method can achieve its aim of ranking pathogen strains if the sampling effort is sufficient. For powdery mildew data, the method indicated that one of the sampled strains tends to have a higher fitness than the four other sampled strains, highlighting the importance of strain diversity for disease dynamics. Our approach allowing the comparison of pathogen strains in natural epidemic is complementary to the classical practice of using experimental infections in controlled conditions to estimate fitness of different pathogen strains. Our statistical tool, implemented in the R package StrainRanking, is mainly based on regression and does not rely on mechanistic assumptions on the pathogen dynamics. Thus, the method can be applied to a wide range of pathogens.
PMCID: PMC3909007  PMID: 24497956
8.  Unraveling the Hidden Heterogeneities of Breast Cancer Based on Functional miRNA Cluster 
PLoS ONE  2014;9(1):e87601.
It has become increasingly clear that the current taxonomy of clinical phenotypes is mixed with molecular heterogeneity, which potentially affects the treatment effect for involved patients. Defining the hidden molecular-distinct diseases using modern large-scale genomic approaches is therefore useful for refining clinical practice and improving intervention strategies. Given that microRNA expression profiling has provided a powerful way to dissect hidden genetic heterogeneity for complex diseases, the aim of the study was to develop a bioinformatics approach that identifies microRNA features leading to the hidden subtyping of complex clinical phenotypes. The basic strategy of the proposed method was to identify optimal miRNA clusters by iteratively partitioning the sample and feature space using the two-ways super-paramagnetic clustering technique. We evaluated the obtained optimal miRNA cluster by determining the consistency of co-expression and the chromosome location among the within-cluster microRNAs, and concluded that the optimal miRNA cluster could lead to a natural partition of disease samples. We applied the proposed method to a publicly available microarray dataset of breast cancer patients that have notoriously heterogeneous phenotypes. We obtained a feature subset of 13 microRNAs that could classify the 71 breast cancer patients into five subtypes with significantly different five-year overall survival rates (45%, 82.4%, 70.6%, 100% and 60% respectively; p = 0.008). By building a multivariate Cox proportional-hazards prediction model for the feature subset, we identified has-miR-146b as one of the most significant predictor (p = 0.045; hazard ratios = 0.39). The proposed algorithm is a promising computational strategy for dissecting hidden genetic heterogeneity for complex diseases, and will be of value for improving cancer diagnosis and treatment.
PMCID: PMC3907466  PMID: 24498150
9.  Characteristics of Networks of Interventions: A Description of a Database of 186 Published Networks 
PLoS ONE  2014;9(1):e86754.
Systematic reviews that employ network meta-analysis are undertaken and published with increasing frequency while related statistical methodology is evolving. Future statistical developments and evaluation of the existing methodologies could be motivated by the characteristics of the networks of interventions published so far in order to tackle real rather than theoretical problems. Based on the recently formed network meta-analysis literature we aim to provide an insight into the characteristics of networks in healthcare research. We searched PubMed until end of 2012 for meta-analyses that used any form of indirect comparison. We collected data from networks that compared at least four treatments regarding their structural characteristics as well as characteristics of their analysis. We then conducted a descriptive analysis of the various network characteristics. We included 186 networks of which 35 (19%) were star-shaped (treatments were compared to a common comparator but not between themselves). The median number of studies per network was 21 and the median number of treatments compared was 6. The majority (85%) of the non-star shaped networks included at least one multi-arm study. Synthesis of data was primarily done via network meta-analysis fitted within a Bayesian framework (113 (61%) networks). We were unable to identify the exact method used to perform indirect comparison in a sizeable number of networks (18 (9%)). In 32% of the networks the investigators employed appropriate statistical methods to evaluate the consistency assumption; this percentage is larger among recently published articles. Our descriptive analysis provides useful information about the characteristics of networks of interventions published the last 16 years and the methods for their analysis. Although the validity of network meta-analysis results highly depends on some basic assumptions, most authors did not report and evaluate them adequately. Reviewers and editors need to be aware of these assumptions and insist on their reporting and accuracy.
PMCID: PMC3899297  PMID: 24466222
10.  Selection of Reliable Biomarkers from PCR Array Analyses Using Relative Distance Computational Model: Methodology and Proof-of-Concept Study 
PLoS ONE  2013;8(12):e83954.
It is increasingly evident about the difficulty to monitor chemical exposure through biomarkers as almost all the biomarkers so far proposed are not specific for any individual chemical. In this proof-of-concept study, adult male zebrafish (Danio rerio) were exposed to 5 or 25 µg/L 17β-estradiol (E2), 100 µg/L lindane, 5 nM 2,3,7,8-tetrachlorodibenzo-p-dioxin (TCDD) or 15 mg/L arsenic for 96 h, and the expression profiles of 59 genes involved in 7 pathways plus 2 well characterized biomarker genes, vtg1 (vitellogenin1) and cyp1a1 (cytochrome P450 1A1), were examined. Relative distance (RD) computational model was developed to screen favorable genes and generate appropriate gene sets for the differentiation of chemicals/concentrations selected. Our results demonstrated that the known biomarker genes were not always good candidates for the differentiation of pair of chemicals/concentrations, and other genes had higher potentials in some cases. Furthermore, the differentiation of 5 chemicals/concentrations examined were attainable using expression data of various gene sets, and the best combination was the set consisting of 50 genes; however, as few as two genes (e.g. vtg1 and hspa5 [heat shock protein 5]) were sufficient to differentiate the five chemical/concentration groups in the present test. These observations suggest that multi-parameter arrays should be more reliable for biomonitoring of chemical exposure than traditional biomarkers, and the RD computational model provides an effective tool for the selection of parameters and generation of parameter sets.
PMCID: PMC3861511  PMID: 24349563
11.  Quantitative Identification of Mutant Alleles Derived from Lung Cancer in Plasma Cell-Free DNA via Anomaly Detection Using Deep Sequencing Data 
PLoS ONE  2013;8(11):e81468.
The detection of rare mutants using next generation sequencing has considerable potential for diagnostic applications. Detecting circulating tumor DNA is the foremost application of this approach. The major obstacle to its use is the high read error rate of next-generation sequencers. Rather than increasing the accuracy of final sequences, we detected rare mutations using a semiconductor sequencer and a set of anomaly detection criteria based on a statistical model of the read error rate at each error position. Statistical models were deduced from sequence data from normal samples. We detected epidermal growth factor receptor (EGFR) mutations in the plasma DNA of lung cancer patients. Single-pass deep sequencing (>100,000 reads) was able to detect one activating mutant allele in 10,000 normal alleles. We confirmed the method using 22 prospective and 155 retrospective samples, mostly consisting of DNA purified from plasma. A temporal analysis suggested potential applications for disease management and for therapeutic decision making to select epidermal growth factor receptor tyrosine kinase inhibitors (EGFR-TKI).
PMCID: PMC3836767  PMID: 24278442
12.  Discovering Subgroups of Patients from DNA Copy Number Data Using NMF on Compacted Matrices 
PLoS ONE  2013;8(11):e79720.
In the study of complex genetic diseases, the identification of subgroups of patients sharing similar genetic characteristics represents a challenging task, for example, to improve treatment decision. One type of genetic lesion, frequently investigated in such disorders, is the change of the DNA copy number (CN) at specific genomic traits. Non-negative Matrix Factorization (NMF) is a standard technique to reduce the dimensionality of a data set and to cluster data samples, while keeping its most relevant information in meaningful components. Thus, it can be used to discover subgroups of patients from CN profiles. It is however computationally impractical for very high dimensional data, such as CN microarray data. Deciding the most suitable number of subgroups is also a challenging problem. The aim of this work is to derive a procedure to compact high dimensional data, in order to improve NMF applicability without compromising the quality of the clustering. This is particularly important for analyzing high-resolution microarray data. Many commonly used quality measures, as well as our own measures, are employed to decide the number of subgroups and to assess the quality of the results. Our measures are based on the idea of identifying robust subgroups, inspired by biologically/clinically relevance instead of simply aiming at well-separated clusters. We evaluate our procedure using four real independent data sets. In these data sets, our method was able to find accurate subgroups with individual molecular and clinical features and outperformed the standard NMF in terms of accuracy in the factorization fitness function. Hence, it can be useful for the discovery of subgroups of patients with similar CN profiles in the study of heterogeneous diseases.
PMCID: PMC3835832  PMID: 24278162
13.  Rule-Based Models of the Interplay between Genetic and Environmental Factors in Childhood Allergy 
PLoS ONE  2013;8(11):e80080.
Both genetic and environmental factors are important for the development of allergic diseases. However, a detailed understanding of how such factors act together is lacking. To elucidate the interplay between genetic and environmental factors in allergic diseases, we used a novel bioinformatics approach that combines feature selection and machine learning. In two materials, PARSIFAL (a European cross-sectional study of 3113 children) and BAMSE (a Swedish birth-cohort including 2033 children), genetic variants as well as environmental and lifestyle factors were evaluated for their contribution to allergic phenotypes. Monte Carlo feature selection and rule based models were used to identify and rank rules describing how combinations of genetic and environmental factors affect the risk of allergic diseases. Novel interactions between genes were suggested and replicated, such as between ORMDL3 and RORA, where certain genotype combinations gave odds ratios for current asthma of 2.1 (95% CI 1.2-3.6) and 3.2 (95% CI 2.0-5.0) in the BAMSE and PARSIFAL children, respectively. Several combinations of environmental factors appeared to be important for the development of allergic disease in children. For example, use of baby formula and antibiotics early in life was associated with an odds ratio of 7.4 (95% CI 4.5-12.0) of developing asthma. Furthermore, genetic variants together with environmental factors seemed to play a role for allergic diseases, such as the use of antibiotics early in life and COL29A1 variants for asthma, and farm living and NPSR1 variants for allergic eczema. Overall, combinations of environmental and life style factors appeared more frequently in the models than combinations solely involving genes. In conclusion, a new bioinformatics approach is described for analyzing complex data, including extensive genetic and environmental information. Interactions identified with this approach could provide useful hints for further in-depth studies of etiological mechanisms and may also strengthen the basis for risk assessment and prevention.
PMCID: PMC3833974  PMID: 24260339
14.  The Chain Ratio Estimator and Regression Estimator with Linear Combination of Two Auxiliary Variables 
PLoS ONE  2013;8(11):e81085.
In sample surveys, it is usual to make use of auxiliary information to increase the precision of the estimators. We propose a new chain ratio estimator and regression estimator of a finite population mean using linear combination of two auxiliary variables and obtain the mean squared error (MSE) equations for the proposed estimators. We find theoretical conditions that make proposed estimators more efficient than the traditional multivariate ratio estimator and the regression estimator using information of two auxiliary variables.
PMCID: PMC3832417  PMID: 24260537
15.  Reconstruction and Analysis of Transcription Factor–miRNA Co-Regulatory Feed-Forward Loops in Human Cancers Using Filter-Wrapper Feature Selection 
PLoS ONE  2013;8(10):e78197.
As one of the most common types of co-regulatory motifs, feed-forward loops (FFLs) control many cell functions and play an important role in human cancers. Therefore, it is crucial to reconstruct and analyze cancer-related FFLs that are controlled by transcription factor (TF) and microRNA (miRNA) simultaneously, in order to find out how miRNAs and TFs cooperate with each other in cancer cells and how they contribute to carcinogenesis. Current FFL studies rely on predicted regulation information and therefore suffer the false positive issue in prediction results. More critically, FFLs generated by existing approaches cannot represent the dynamic and conditional regulation relationship under different experimental conditions.
Methodology/Principal Findings
In this study, we proposed a novel filter-wrapper feature selection method to accurately identify co-regulatory mechanism by incorporating prior information from predicted regulatory interactions with parallel miRNA/mRNA expression datasets. By applying this method, we reconstructed 208 and 110 TF-miRNA co-regulatory FFLs from human pan-cancer and prostate datasets, respectively. Further analysis of these cancer-related FFLs showed that the top-ranking TF STAT3 and miRNA hsa-let-7e are key regulators implicated in human cancers, which have regulated targets significantly enriched in cellular process regulations and signaling pathways that are involved in carcinogenesis.
In this study, we introduced an efficient computational approach to reconstruct co-regulatory FFLs by accurately identifying gene co-regulatory interactions. The strength of the proposed feature selection method lies in the fact it can precisely filter out false positives in predicted regulatory interactions by quantitatively modeling the complex co-regulation of target genes mediated by TFs and miRNAs simultaneously. Moreover, the proposed feature selection method can be generally applied to other gene regulation studies using parallel expression data with respect to different biological contexts.
PMCID: PMC3812136  PMID: 24205155
16.  A Kalman-Filter Based Approach to Identification of Time-Varying Gene Regulatory Networks 
PLoS ONE  2013;8(10):e74571.
Conventional identification methods for gene regulatory networks (GRNs) have overwhelmingly adopted static topology models, which remains unchanged over time to represent the underlying molecular interactions of a biological system. However, GRNs are dynamic in response to physiological and environmental changes. Although there is a rich literature in modeling static or temporally invariant networks, how to systematically recover these temporally changing networks remains a major and significant pressing challenge. The purpose of this study is to suggest a two-step strategy that recovers time-varying GRNs.
It is suggested in this paper to utilize a switching auto-regressive model to describe the dynamics of time-varying GRNs, and a two-step strategy is proposed to recover the structure of time-varying GRNs. In the first step, the change points are detected by a Kalman-filter based method. The observed time series are divided into several segments using these detection results; and each time series segment belonging to two successive demarcating change points is associated with an individual static regulatory network. In the second step, conditional network structure identification methods are used to reconstruct the topology for each time interval. This two-step strategy efficiently decouples the change point detection problem and the topology inference problem. Simulation results show that the proposed strategy can detect the change points precisely and recover each individual topology structure effectively. Moreover, computation results with the developmental data of Drosophila Melanogaster show that the proposed change point detection procedure is also able to work effectively in real world applications and the change point estimation accuracy exceeds other existing approaches, which means the suggested strategy may also be helpful in solving actual GRN reconstruction problem.
PMCID: PMC3792119  PMID: 24116005
17.  Activation of STAT3 in Human Gastric Cancer Cells via Interleukin (IL)-6-Type Cytokine Signaling Correlates with Clinical Implications 
PLoS ONE  2013;8(10):e75788.
The signal transducers and activators of transcription 3 (STAT3) signaling pathway plays important roles in oncogenesis, angiogenesis, immunity, and tumor cell invasion. In the present study, we investigated the association of interleukin (IL)-6/STAT3 signaling pathway with T lymphocytes and clinical implication in patients with gastric cancer.
Seventy one patients who underwent gastrectomy due to gastric adenocarcinoma were studied. Blood samples were collected before and after surgical gastrectomy to quantify the levels of IL-6, IL-10 and VEGF using an enzyme-linked immunosorbent assay, as well as T lymphocyte subsets (CD3+, CD4+, CD8+, CD4+/CD8+) and natural killer (NK) cells by a flow cytometry. Furthermore, the expression of IL-6, survivin, STAT3, STAT3 phosphorylation (p-STAT3), and VEGF were determined in human gastric cancer and adjacent normal mucosa through Western blot and immunohistochemistry.
Postoperative levels of IL-6, IL-10 and VEGF in serum were significantly lower than preoperative levels. Percentages of T-cell subsets and NK cells in blood were significantly increased after postoperative-week 1 as compared to preoperative group, which was further augmented at 1 month after gastrectomy. In addition, the expression of IL-6, survivin, STAT3, p-STAT3, and VEGF were increased in human gastric cancer tissues as compared to adjacent normal mucosa. Their expression was associated with TNM stage of gastric cancer. The level of STAT3 activation in clinical samples was correlated with IL-6 expression. All gastric tumor samples, which expressed p-STAT3, also expressed IL-6 with weak expression detected in adjacent normal mucosa.
Increased IL-6-induced activation of STAT3 was observed in neoplastic gastric tissue, which positively correlated with tumor progression. Moreover, IL-6 and STAT3 downstream signals such as IL-10 and VEGF were reduced in patients after removal of gastric cancer as compared to pre-operation. Therefore, inhibition of the IL-6/STAT3 signaling pathway may provide a new therapeutic strategy against gastric cancer.
PMCID: PMC3792128  PMID: 24116074
18.  Periodontal Disease and the Oral Microbiota in New-Onset Rheumatoid Arthritis 
Arthritis and rheumatism  2012;64(10):3083-3094.
To profile the subgingival oral microbiota abundance and diversity in never-treated, new-onset rheumatoid arthritis (NORA) patients.
Periodontal disease (PD) status, clinical activity and sociodemographic factors were determined in patients with NORA, chronic RA (CRA) and healthy subjects. Massively parallel pyrosequencing was used to compare the composition of subgingival microbiota and establish correlations between presence/abundance of bacteria and disease phenotypes. Anti-P. gingivalis antibodies were tested to assess prior exposure.
The more advanced forms of periodontitis are already present at disease onset in NORA patients. The subgingival microbiota of NORA is distinct from controls. In most cases, however, these differences can be attributed to PD severity and are not inherent to RA. The presence and abundance of P. gingivalis is directly associated with PD severity as well, is not unique to RA, and does not correlate with anti-citrullinated peptide antibody (ACPA) titers. Overall exposure to P. gingivalis is similar in RA and controls, observed in 78.4% and 83.3%, respectively. Anaeroglobus geminatus correlated with ACPA/RF presence. Prevotella and Leptotrichia species are the only characteristic taxa in the NORA group irrespective of PD status.
NORA patients exhibit a high prevalence of PD at disease onset, despite their young age and paucity of smoking history. The subgingival microbiota of NORA patients is similar to CRA and healthy subjects of comparable PD severity. Although colonization with P. gingivalis correlates with PD severity, overall exposure is similar among groups. The role of A. geminatus and Prevotella/Leptotrichia species in this process merits further study.
PMCID: PMC3428472  PMID: 22576262
19.  Variant Callers for Next-Generation Sequencing Data: A Comparison Study 
PLoS ONE  2013;8(9):e75619.
Next generation sequencing (NGS) has been leading the genetic study of human disease into an era of unprecedented productivity. Many bioinformatics pipelines have been developed to call variants from NGS data. The performance of these pipelines depends crucially on the variant caller used and on the calling strategies implemented. We studied the performance of four prevailing callers, SAMtools, GATK, glftools and Atlas2, using single-sample and multiple-sample variant-calling strategies. Using the same aligner, BWA, we built four single-sample and three multiple-sample calling pipelines and applied the pipelines to whole exome sequencing data taken from 20 individuals. We obtained genotypes generated by Illumina Infinium HumanExome v1.1 Beadchip for validation analysis and then used Sanger sequencing as a “gold-standard” method to resolve discrepancies for selected regions of high discordance. Finally, we compared the sensitivity of three of the single-sample calling pipelines using known simulated whole genome sequence data as a gold standard. Overall, for single-sample calling, the called variants were highly consistent across callers and the pairwise overlapping rate was about 0.9. Compared with other callers, GATK had the highest rediscovery rate (0.9969) and specificity (0.99996), and the Ti/Tv ratio out of GATK was closest to the expected value of 3.02. Multiple-sample calling increased the sensitivity. Results from the simulated data suggested that GATK outperformed SAMtools and glfSingle in sensitivity, especially for low coverage data. Further, for the selected discrepant regions evaluated by Sanger sequencing, variant genotypes called by exome sequencing versus the exome array were more accurate, although the average variant sensitivity and overall genotype consistency rate were as high as 95.87% and 99.82%, respectively. In conclusion, GATK showed several advantages over other variant callers for general purpose NGS analyses. The GATK pipelines we developed perform very well.
PMCID: PMC3785481  PMID: 24086590
20.  Voltage Affects the Dissociation Rate Constant of the m2 Muscarinic Receptor 
PLoS ONE  2013;8(9):e74354.
G-protein coupled receptors (GPCRs) comprise the largest protein family and mediate the vast majority of signal transduction processes in the body. Until recently GPCRs were not considered to be voltage dependent. Newly it was shown for several GPCRs that the first step in GPCR activation, the binding of agonist to the receptor, is voltage sensitive: Voltage shifts the receptor between two states that differ in their binding affinity. Here we show that this shift involves the rate constant of dissociation. We used the m2 muscarinic receptor (m2R) a prototypical GPCR and measured directly the dissociation of [3H]ACh from m2R expressed Xenopus oocytes. We show, for the first time, that the voltage dependent change in affinity is implemented by voltage shifting the receptor between two states that differ in their rate constant of dissociation. Furthermore, we provide evidence that suggest that the above shift is achieved by voltage regulating the coupling of the GPCR to its G protein.
PMCID: PMC3760861  PMID: 24019965
21.  Correlation Analysis Connects Cancer Subtypes 
PLoS ONE  2013;8(7):e69747.
We provided a cross-tissue comparative analysis of between-subtype molecular commonality for ovarian cancer, breast cancer, hepatocellular carcinoma, glioma, lung squamous carcinoma and nasopharyngeal carcinoma. Our analysis showed that molecular subtypes with similar phenotype or similar clinical outcome could be correlated by their transcriptional profile and pathway profile. Pathway dysregulation across multiple cancer subtypes was also revealed by Gene Set Enrichment Analysis. Dysregulation of ‘complement and coagulation cascades’ was observed in a total of eleven subtypes across five tissues, implicating that the role of this process in personalized immune-based therapy may be worth further exploring.
PMCID: PMC3704535  PMID: 23861980
22.  Modelling Competing Endogenous RNA Networks 
PLoS ONE  2013;8(6):e66609.
MicroRNAs (miRNAs) are small RNA molecules, about 22 nucleotide long, which post-transcriptionally regulate their target messenger RNAs (mRNAs). They accomplish key roles in gene regulatory networks, ranging from signaling pathways to tissue morphogenesis, and their aberrant behavior is often associated with the development of various diseases. Recently it has been experimentally shown that the way miRNAs interact with their targets can be described in terms of a titration mechanism. From a theoretical point of view titration mechanisms are characterized by threshold effect at near-equimolarity of the different chemical species, hypersensitivity of the system around the threshold, and cross-talk among targets. The latter characteristic has been lately identified as competing endogenous RNA (ceRNA) effect to mark those indirect interactions among targets of a common pool of miRNAs they are in competition for. Here we propose a stochastic model to analyze the equilibrium and out-of-equilibrium properties of a network of miRNAs interacting with mRNA targets. In particular we are able to describe in detail the peculiar equilibrium and non-equilibrium phenomena that the system displays in proximity to the threshold: (i) maximal cross-talk and correlation between targets, (ii) robustness of ceRNA effect with respect to the model's parameters and in particular to the catalyticity of the miRNA-mRNA interaction, and (iii) anomalous response-time to external perturbations.
PMCID: PMC3694070  PMID: 23840508
23.  The miR-17-92 Cluster and its Target THBS1 are Differentially Expressed in Angiosarcomas Dependent on MYC Amplification 
Genes, chromosomes & cancer  2012;51(6):569-578.
Angiosarcomas (AS) represent a heterogeneous group of malignant vascular tumors that may occur spontaneously as primary tumors or secondarily after radiation therapy or in the context of chronic lymphedema. Most secondary AS have been associated with MYC oncogene amplification, while the role of MYC abnormalities in primary AS is not well defined. Twenty-two primary and secondary AS were analyzed by array-comparative genomic hybridization (aCGH) and by deep sequencing of small RNA libraries. By aCGH and subsequently confirmed by FISH, MYC amplification was identified in three of six primary tumors and in eight of 12 secondary AS. We have also found MAML1 as a new potential oncogene in MYC-amplified AS. Significant up-regulation of the miR17-92 cluster was observed in MYC-amplified AS compared to AS lacking MYC amplification and the control group (other vascular tumors, non-vascular sarcomas). Moreover, MYC-amplified AS were associated with a significantly lower expression of thrombospondin-1 (THBS1) than AS without MYC amplification or controls. Altogether, our study implicates MYC amplification not only in the pathogenesis of secondary AS but also in a subset of primary AS. Thus, MYC amplification may play a crucial role in the angiogenic phenotype of AS through up-regulation of the miR-17-92 cluster, which subsequently downregulates THBS1, a potent endogenous inhibitor of angiogenesis.
PMCID: PMC3360479  PMID: 22383169
24.  Identifying Gene Set Association Enrichment Using the Coefficient of Intrinsic Dependence 
PLoS ONE  2013;8(3):e58851.
Gene set testing problem has become the focus of microarray data analysis. A gene set is a group of genes that are defined by a priori biological knowledge. Several statistical methods have been proposed to determine whether functional gene sets express differentially (enrichment and/or deletion) in variations of phenotypes. However, little attention has been given to analyzing the dependence structure among gene sets. In this study, we have proposed a novel statistical method of gene set association analysis to identify significantly associated gene sets using the coefficient of intrinsic dependence. The simulation studies show that the proposed method outperforms the conventional methods to detect general forms of association in terms of control of type I error and power. The correlation of intrinsic dependence has been applied to a breast cancer microarray dataset to quantify the un-supervised relationship between two sets of genes in the tumor and non-tumor samples. It was observed that the existence of gene-set association differed across various clinical cohorts. In addition, a supervised learning was employed to illustrate how gene sets, in signaling transduction pathways or subnetworks regulated by a set of transcription factors, can be discovered using microarray data. In conclusion, the coefficient of intrinsic dependence provides a powerful tool for detecting general types of association. Hence, it can be useful to associate gene sets using microarray expression data. Through connecting relevant gene sets, our approach has the potential to reveal underlying associations by drawing a statistically relevant network in a given population, and it can also be used to complement the conventional gene set analysis.
PMCID: PMC3597597  PMID: 23516564
25.  Identification of a Novel, Recurrent HEY1-NCOA2 Fusion in Mesenchymal Chondrosarcoma based on a Genome-wide Screen of Exon-level Expression Data 
Genes, chromosomes & cancer  2011;51(2):127-139.
Cancer gene fusions that encode a chimeric protein are often characterized by an intragenic discontinuity in the RNA expression levels of the exons that are 5′ or 3′ to the fusion point in one or both of the fusion partners due to differences in the levels of activation of their respective promoters. Based on this, we developed an unbiased, genome-wide bioinformatic screen for gene fusions using Affymetrix Exon array expression data. Using a training set of 46 samples with different known gene fusions, we developed a data analysis pipeline, the “Fusion Score (FS) model”, to score and rank genes for intragenic changes in expression. In a separate discovery set of 41 tumor samples with possible unknown gene fusions, the FS model generated a list of 552 candidate genes. The transcription factor gene NCOA2 was one of the candidates identified in a mesenchymal chondrosarcoma. A novel HEY1-NCOA2 fusion was identified by 5′ RACE, representing an in-frame fusion of HEY1 exon 4 to NCOA2 exon 13. RT-PCR or FISH evidence of this HEY1-NCOA2 fusion was present in all additional mesenchymal chondrosarcomas tested with a definitive histologic diagnosis and adequate material for analysis (n=9) but was absent in 15 samples of other subtypes of chondrosarcomas. We also identified a NUP107-LGR5 fusion in a dedifferentiated liposarcoma but analysis of 17 additional samples did not confirm it as a recurrent event in this sarcoma type. The novel HEY1-NCOA2 fusion appears to be the defining and diagnostic gene fusion in mesenchymal chondrosarcomas.
PMCID: PMC3235801  PMID: 22034177

