Search tips
Search criteria

Results 1-25 (68)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
more »
Document Types
1.  Nonparametric Tests for Differential Histone Enrichment with ChIP-Seq Data 
Cancer Informatics  2015;14(Suppl 1):11-22.
Chromatin immunoprecipitation sequencing (ChIP-seq) is a powerful method for analyzing protein interactions with DNA. It can be applied to identify the binding sites of transcription factors (TFs) and genomic landscape of histone modification marks (HMs). Previous research has largely focused on developing peak-calling procedures to detect the binding sites for TFs. However, these procedures may fail when applied to ChIP-seq data of HMs, which have diffuse signals and multiple local peaks. In addition, it is important to identify genes with differential histone enrichment regions between two experimental conditions, such as different cellular states or different time points. Parametric methods based on Poisson/negative binomial distribution have been proposed to address this differential enrichment problem and most of these methods require biological replications. However, many ChIP-seq data usually have a few or even no replicates. We propose a nonparametric method to identify the genes with differential histone enrichment regions even without replicates. Our method is based on nonparametric hypothesis testing and kernel smoothing in order to capture the spatial differences in histone-enriched profiles. We demonstrate the method using ChIP-seq data on a comparative epigenomic profiling of adipogenesis of murine adipose stromal cells and the Encyclopedia of DNA Elements (ENCODE) ChIP-seq data. Our method identifies many genes with differential H3K27ac histone enrichment profiles at gene promoter regions between proliferating preadipocytes and mature adipocytes in murine 3T3-L1 cells. The test statistics also correlate with the gene expression changes well and are predictive to gene expression changes, indicating that the identified differentially enriched regions are indeed biologically meaningful.
PMCID: PMC4310510  PMID: 25657574
kernel smoothing; normalization; nonparametric testing; spatial histone profiles
2.  Systems Biology Approaches to Epidemiological Studies of Complex Diseases 
Systems biology approaches to epidemiological studies of complex diseases include collection of genetic, genomic, epigenomic and metagenomic data in large-scale epidemiological studies of complex phenotypes. Designs and analyses of such studies raise many statistical challenges. This paper reviews some issues related to integrative analysis of such high dimensional and inter-related data sets and outline some possible solutions. I focus my review on integrative approaches for genome-wide genetic variants and gene expression data, methods for joint analysis of genetic and epigenetic variants and methods for analysis of microbiome data. Statistical methods such as mediation analysis, high dimensional instrumental variable regression, sparse signal recovery and compositional data regression provide potential frameworks for integrative analysis of these high dimensional genomic data.
PMCID: PMC3947451  PMID: 24019288
3.  Low/Negative Expression of PDGFR-α Identifies the Candidate Primary Mesenchymal Stromal Cells in Adult Human Bone Marrow 
Stem Cell Reports  2014;3(6):965-974.
Human bone marrow (BM) contains a rare population of nonhematopoietic mesenchymal stromal cells (MSCs), which are of central importance for the hematopoietic microenvironment. However, the precise phenotypic definition of these cells in adult BM has not yet been reported. In this study, we show that low/negative expression of CD140a (PDGFR-α) on lin−/CD45−/CD271+ BM cells identified a cell population with very high MSC activity, measured as fibroblastic colony-forming unit frequency and typical in vitro and in vivo stroma formation and differentiation capacities. Furthermore, these cells exhibited high levels of genes associated with mesenchymal lineages and HSC supportive function. Moreover, lin−/CD45−/CD271+/CD140alow/− cells effectively mediated the ex vivo expansion of transplantable CD34+ hematopoietic stem cells. Taken together, these data indicate that CD140a is a key negative selection marker for adult human BM-MSCs, which enables to prospectively isolate a close to pure population of candidate human adult stroma stem/progenitor cells with potent hematopoiesis-supporting capacity.
Graphical Abstract
•Comparative gene expression profiling identified MSC markers•Primary adult bone marrow MSCs are CD140 (PDGFR-α) low/negative•CD140alow/− cells have typical in vitro and in vivo MSC properties•Coculture with CD140alow/− cells effectively expanded transplantable CD34+ HSCs
Scheding and colleagues report that low/negative expression of PDGFR-α on lin−/CD45−/CD271+ bone marrow cells identified a cell population with very high CFU-F activity, typical in vitro and in vivo MSC properties, and HSC supportive function. These data indicate that PDGFR-α is a key marker for adult human BM-MSCs, which are critical for the definition of the putative stroma stem cells.
PMCID: PMC4264066  PMID: 25454633
4.  Sodium-Glucose Transporter-2 (SGLT2; SLC5A2) Enhances Cellular Uptake of Aminoglycosides 
PLoS ONE  2014;9(9):e108941.
Aminoglycoside antibiotics, like gentamicin, continue to be clinically essential worldwide to treat life-threatening bacterial infections. Yet, the ototoxic and nephrotoxic side-effects of these drugs remain serious complications. A major site of gentamicin uptake and toxicity resides within kidney proximal tubules that also heavily express electrogenic sodium-glucose transporter-2 (SGLT2; SLC5A2) in vivo. We hypothesized that SGLT2 traffics gentamicin, and promotes cellular toxicity. We confirmed in vitro expression of SGLT2 in proximal tubule-derived KPT2 cells, and absence in distal tubule-derived KDT3 cells. D-glucose competitively decreased the uptake of 2-(N-(7-nitrobenz-2-oxa-1,3-diazol-4-yl)amino)-2-deoxyglucose (2-NBDG), a fluorescent analog of glucose, and fluorescently-tagged gentamicin (GTTR) by KPT2 cells. Phlorizin, an SGLT2 antagonist, strongly inhibited uptake of 2-NBDG and GTTR by KPT2 cells in a dose- and time-dependent manner. GTTR uptake was elevated in KDT3 cells transfected with SGLT2 (compared to controls); and this enhanced uptake was attenuated by phlorizin. Knock-down of SGLT2 expression by siRNA reduced gentamicin-induced cytotoxicity. In vivo, SGLT2 was robustly expressed in kidney proximal tubule cells of heterozygous, but not null, mice. Phlorizin decreased GTTR uptake by kidney proximal tubule cells in Sglt2+/− mice, but not in Sglt2−/− mice. However, serum GTTR levels were elevated in Sglt2−/− mice compared to Sglt2+/− mice, and in phlorizin-treated Sglt2+/− mice compared to vehicle-treated Sglt2+/− mice. Loss of SGLT2 function by antagonism or by gene deletion did not affect gentamicin cochlear loading or auditory function. Phlorizin did not protect wild-type mice from kanamycin-induced ototoxicity. We conclude that SGLT2 can traffic gentamicin and contribute to gentamicin-induced cytotoxicity.
PMCID: PMC4182564  PMID: 25268124
5.  Primary mesenchymal stem cells in human transplanted lungs are CD90/CD105 perivascularly located tissue-resident cells 
BMJ Open Respiratory Research  2014;1(1):e000027.
Mesenchymal stem cells (MSC) have not only been implicated in the development of lung diseases, but they have also been proposed as a future cell-based therapy for lung diseases. However, the cellular identity of the primary MSC in human lung tissues has not yet been reported. This study therefore aimed to identify and characterise the ‘bona fide’ MSC in human lungs and to investigate if the MSC numbers correlate with the development of bronchiolitis obliterans syndrome in lung-transplanted patients.
Primary lung MSC were directly isolated or culture-derived from central and peripheral transbronchial biopsies of lung-transplanted patients and evaluated using a comprehensive panel of in vitro and in vivo assays.
Primary MSC were enriched in the CD90/CD105 mononuclear cell fraction with mesenchymal progenitor frequencies of up to four colony-forming units, fibroblast/100 cells. In situ staining of lung tissues revealed that CD90/CD105 MSCs were located perivascularly. MSC were tissue-resident and exclusively donor lung-derived even in biopsies obtained from patients as long as 16 years after transplantation. Culture-derived mesenchymal stromal cells showed typical in vitro MSC properties; however, xenotransplantation into non-obese diabetic/severe combined immunodeficient (NOD/SCID) mice showed that lung MSC readily differentiated into adipocytes and stromal tissues, but lacked significant in vivo bone formation.
These data clearly demonstrate that primary MSC in human lung tissues are not only tissue resident but also tissue-specific. The identification and phenotypic characterisation of primary lung MSC is an important first step in identifying the role of MSC in normal lung physiology and pulmonary diseases.
PMCID: PMC4212711  PMID: 25478178
Lung Transplantation
6.  A Hierarchical Bayesian Model for Estimating and Inferring Differential Isoform Expression for Multi-Sample RNA-Seq Data 
Statistics in biosciences  2011;5(1):119-137.
RNA-Seq has drastically changed our ways of studying transcrip-tomes in providing more precise estimates of gene expression, including isoform-specific expression. Most of the available methods for RNA-Seq data focus on one sample at a time. We present in this paper a Poisson-Gamma hierarchical model for multi-sample RNA-Seq data analysis in order to simultaneously estimate isoform-specific expression and to identify differentially expressed iso-forms. Our model has the advantage of borrowing information across all samples in estimating expression levels, which can improve the estimates drastically, particularly for low abundance isoforms. Furthermore, our hierarchical model has the ability to account for overdispersion in the data and also can incorporate sample-specific covariates in the underlying model, which facilitates the isoform-specific differential expression analysis. Simulation studies demonstrated that this Bayesian multi-sample approach can lead to more precise estimates of isoform-specific expression and higher power to detect differential expression by borrowing information across all samples than single sample analysis, especially for isoforms of low abundance. We further illustrated our methods using the RNA-Seq data of 10 Yoruban and 10 Caucasian individuals.
PMCID: PMC3669631  PMID: 23737925
Mixture of Poisson-Gamma model; Markov Chain Monte Carlo Sampling Next Generation Sequencing
7.  Structure-constrained sparse canonical correlation analysis with an application to microbiome data analysis 
Biostatistics (Oxford, England)  2012;14(2):244-258.
Motivated by studying the association between nutrient intake and human gut microbiome composition, we developed a method for structure-constrained sparse canonical correlation analysis (ssCCA) in a high-dimensional setting. ssCCA takes into account the phylogenetic relationships among bacteria, which provides important prior knowledge on evolutionary relationships among bacterial taxa. Our ssCCA formulation utilizes a phylogenetic structure-constrained penalty function to impose certain smoothness on the linear coefficients according to the phylogenetic relationships among the taxa. An efficient coordinate descent algorithm is developed for optimization. A human gut microbiome data set is used to illustrate this method. Both simulations and real data applications show that ssCCA performs better than the standard sparse CCA in identifying meaningful variables when there are structures in the data.
PMCID: PMC3590923  PMID: 23074263
Dimension reduction; Graph; Phylogenetic tree; Regularization; Variable selection
8.  Adjusting for High-dimensional Covariates in Sparse Precision Matrix Estimation by ℓ1-Penalization 
Journal of multivariate analysis  2013;116:10.1016/j.jmva.2013.01.005.
Motivated by the analysis of genetical genomic data, we consider the problem of estimating high-dimensional sparse precision matrix adjusting for possibly a large number of covariates, where the covariates can affect the mean value of the random vector. We develop a two-stage estimation procedure to first identify the relevant covariates that affect the means by a joint ℓ1 penalization. The estimated regression coefficients are then used to estimate the mean values in a multivariate sub-Gaussian model in order to estimate the sparse precision matrix through a ℓ1-penalized log-determinant Bregman divergence. Under the multivariate normal assumption, the precision matrix has the interpretation of a conditional Gaussian graphical model. We show that under some regularity conditions, the estimates of the regression coefficients are consistent in element-wise ℓ∞ norm, Frobenius norm and also spectral norm even when p ≫ n and q ≫ n. We also show that with probability converging to one, the estimate of the precision matrix correctly specifies the zero pattern of the true precision matrix. We illustrate our theoretical results via simulations and demonstrate that the method can lead to improved estimate of the precision matrix. We apply the method to an analysis of a yeast genetical genomic data.
PMCID: PMC3653344  PMID: 23687392
Estimation bounds; Graphical Model; Model selection consistency; Oracle property
The annals of applied statistics  2013;7(1):10.1214/12-AOAS592.
With the development of next generation sequencing technology, researchers have now been able to study the microbiome composition using direct sequencing, whose output are bacterial taxa counts for each microbiome sample. One goal of microbiome study is to associate the microbiome composition with environmental covariates. We propose to model the taxa counts using a Dirichlet-multinomial (DM) regression model in order to account for overdispersion of observed counts. The DM regression model can be used for testing the association between taxa composition and covariates using the likelihood ratio test. However, when the number of the covariates is large, multiple testing can lead to loss of power. To deal with the high dimensionality of the problem, we develop a penalized likelihood approach to estimate the regression parameters and to select the variables by imposing a sparse group ℓ1 penalty to encourage both group-level and within-group sparsity. Such a variable selection procedure can lead to selection of the relevant covariates and their associated bacterial taxa. An efficient block-coordinate descent algorithm is developed to solve the optimization problem. We present extensive simulations to demonstrate that the sparse DM regression can result in better identification of the microbiome-associated covariates than models that ignore overdispersion or only consider the proportions. We demonstrate the power of our method in an analysis of a data set evaluating the effects of nutrient intake on human gut microbiome composition. Our results have clearly shown that the nutrient intake is strongly associated with the human gut microbiome.
PMCID: PMC3846354  PMID: 24312162
Coordinate descent; Counts data; Overdispersion; Regularized likelihood; Sparse group penalty
Statistica Sinica  2011;21(4):1515-1540.
Nonparametric varying coefficient models are useful for studying the time-dependent effects of variables. Many procedures have been developed for estimation and variable selection in such models. However, existing work has focused on the case when the number of variables is fixed or smaller than the sample size. In this paper, we consider the problem of variable selection and estimation in varying coefficient models in sparse, high-dimensional settings when the number of variables can be larger than the sample size. We apply the group Lasso and basis function expansion to simultaneously select the important variables and estimate the nonzero varying coefficient functions. Under appropriate conditions, we show that the group Lasso selects a model of the right order of dimensionality, selects all variables with the norms of the corresponding coefficient functions greater than certain threshold level, and is estimation consistent. However, the group Lasso is in general not selection consistent and tends to select variables that are not important in the model. In order to improve the selection results, we apply the adaptive group Lasso. We show that, under suitable conditions, the adaptive group Lasso has the oracle selection property in the sense that it correctly selects important variables with probability converging to one. In contrast, the group Lasso does not possess such oracle property. Both approaches are evaluated using simulation and demonstrated on a data example.
PMCID: PMC3902862  PMID: 24478564
Basis expansion; group Lasso; high-dimensional data; non-parametric coefficient function; selection consistency; sparsity
The annals of applied statistics  2013;7(3):1249-1835.
Multivariate time series (MTS) data such as time course gene expression data in genomics are often collected to study the dynamic nature of the systems. These data provide important information about the causal dependency among a set of random variables. In this paper, we introduce a computationally efficient algorithm to learn directed acyclic graphs (DAGs) based on MTS data, focusing on learning the local structure of a given target variable. Our algorithm is based on learning all parents (P), all children (C) and some descendants (D) (PCD) iteratively, utilizing the time order of the variables to orient the edges. This time series PCD-PCD algorithm (tsPCD-PCD) extends the previous PCD-PCD algorithm to dependent observations and utilizes composite likelihood ratio tests (CLRTs) for testing the conditional independence. We present the asymptotic distribution of the CLRT statistic and show that the tsPCD-PCD is guaranteed to recover the true DAG structure when the faithfulness condition holds and the tests correctly reject the null hypotheses. Simulation studies show that the CLRTs are valid and perform well even when the sample sizes are small. In addition, the tsPCD-PCD algorithm outperforms the PCD-PCD algorithm in recovering the local graph structures. We illustrate the algorithm by analyzing a time course gene expression data related to mouse T-cell activation.
PMCID: PMC3898602  PMID: 24465291
Bayesian network; Composite likelihood ratio test; Genetic network; PCD-PCD algorithm
12.  Network-based Analysis of Multivariate Gene Expression Data 
Multivariate microarray gene expression data are commonly collected to study the genomic responses under ordered conditions such as over increasing/decreasing dose levels or over time during biological processes, where the expression levels of a give gene are expected to be dependent. One important question from such multivariate gene expression experiments is to identify genes that show different expression patterns over treatment dosages or over time; these genes can also point to the pathways that are perturbed during a given biological process. Several empirical Bayes approaches have been developed for identifying the differentially expressed genes in order to account for the parallel structure of the data and to borrow information across all the genes. However, these methods assume that the genes are independent. In this paper, we introduce an alternative empirical Bayes approach for analysis of multivariate gene expression data by assuming a discrete Markov random field (MRF) prior, where the dependency of the differential expression patterns of genes on the networks are modeled by a Markov random field. Simulation studies indicated that the method is quite effective in identifying genes and the modified subnetworks and has higher sensitivity than the commonly used procedures that do not use the pathway information, with similar observed false discovery rates. We applied the proposed methods for analysis of a microarray time course gene expression study of TrkA- and TrkB-transfected neuroblastoma cell lines and identified genes and subnetworks on MAPK, focal adhesion and prion disease pathways that may explain cell differentiation in TrkA-transfected cell lines.
PMCID: PMC3692268  PMID: 23385535
Markov random field; empirical Bayes; KEGG pathways
13.  Simultaneous Discovery of Rare and Common Segment Variants 
Biometrika  2012;100(1):157-172.
Copy number variant is an important type of genetic structural variation appearing in germline DNA, ranging from common to rare in a population. Both rare and common copy number variants have been reported to be associated with complex diseases, so it is therefore important to simultaneously identify both based on a large set of population samples. We develop a proportion adaptive segment selection procedure that automatically adjusts to the unknown proportions of the carriers of the segment variants. We characterize the detection boundary that separates the region where a segment variant is detectable by some method from the region where it cannot be detected. Although the detection boundaries are very different for the rare and common segment variants, it is shown that the proposed procedure can reliably identify both whenever they are detectable. Compared with methods for single sample analysis, this procedure gains power by pooling information from multiple samples. The method is applied to analyze neuroblastoma samples and identifies a large number of copy number variants that are missed by single-sample methods.
PMCID: PMC3696347  PMID: 23825436
DNA copy number variant; Information pooling; Population structural variant
14.  PennSeq: accurate isoform-specific gene expression quantification in RNA-Seq by modeling non-uniform read distribution 
Nucleic Acids Research  2013;42(3):e20.
Correctly estimating isoform-specific gene expression is important for understanding complicated biological mechanisms and for mapping disease susceptibility genes. However, estimating isoform-specific gene expression is challenging because various biases present in RNA-Seq (RNA sequencing) data complicate the analysis, and if not appropriately corrected, can affect isoform expression estimation and downstream analysis. In this article, we present PennSeq, a statistical method that allows each isoform to have its own non-uniform read distribution. Instead of making parametric assumptions, we give adequate weight to the underlying data by the use of a non-parametric approach. Our rationale is that regardless what factors lead to non-uniformity, whether it is due to hexamer priming bias, local sequence bias, positional bias, RNA degradation, mapping bias or other unknown reasons, the probability that a fragment is sampled from a particular region will be reflected in the aligned data. This empirical approach thus maximally reflects the true underlying non-uniform read distribution. We evaluate the performance of PennSeq using both simulated data with known ground truth, and using two real Illumina RNA-Seq data sets including one with quantitative real time polymerase chain reaction measurements. Our results indicate superior performance of PennSeq over existing methods, particularly for isoforms demonstrating severe non-uniformity. PennSeq is freely available for download at
PMCID: PMC3919567  PMID: 24362841
15.  Robust Gaussian Graphical Modeling via l1 Penalization 
Biometrics  2012;68(4):1197-1206.
Gaussian graphical models have been widely used as an effective method for studying the conditional independency structure among genes and for constructing genetic networks. However, gene expression data typically have heavier tails or more outlying observations than the standard Gaussian distribution. Such outliers in gene expression data can lead to wrong inference on the dependency structure among the genes. We propose a l1 penalized estimation procedure for the sparse Gaussian graphical models that is robustified against possible outliers. The likelihood function is weighted according to how the observation is deviated, where the deviation of the observation is measured based on its own likelihood. An efficient computational algorithm based on the coordinate gradient descent method is developed to obtain the minimizer of the negative penalized robustified-likelihood, where nonzero elements of the concentration matrix represents the graphical links among the genes. After the graphical structure is obtained, we re-estimate the positive definite concentration matrix using an iterative proportional fitting algorithm. Through simulations, we demonstrate that the proposed robust method performs much better than the graphical Lasso for the Gaussian graphical models in terms of both graph structure selection and estimation when outliers are present. We apply the robust estimation procedure to an analysis of yeast gene expression data and show that the resulting graph has better biological interpretation than that obtained from the graphical Lasso.
PMCID: PMC3535542  PMID: 23020775
Coordinate descent algorithm; Genetic Network; Iterative proportional fitting; Outliers; Penalized likelihood
16.  Robust Detection and Identification of Sparse Segments in Ultra-High Dimensional Data Analysis 
Copy number variants (CNVs) are alternations of DNA of a genome that results in the cell having a less or more than two copies of segments of the DNA. CNVs correspond to relatively large regions of the genome, ranging from about one kilobase to several megabases, that are deleted or duplicated. Motivated by CNV analysis based on next generation sequencing data, we consider the problem of detecting and identifying sparse short segments hidden in a long linear sequence of data with an unspecified noise distribution. We propose a computationally efficient method that provides a robust and near-optimal solution for segment identification over a wide range of noise distributions. We theoretically quantify the conditions for detecting the segment signals and show that the method near-optimally estimates the signal segments whenever it is possible to detect their existence. Simulation studies are carried out to demonstrate the efficiency of the method under different noise distributions. We present results from a CNV analysis of a HapMap Yoruban sample to further illustrate the theory and the methods.
PMCID: PMC3563068  PMID: 23393425
Robust segment detector; Robust segment identifier; optimality; DNA copy number variant; next generation sequencing data
17.  Intestinal microbiota metabolism of L-carnitine, a nutrient in red meat, promotes atherosclerosis 
Nature medicine  2013;19(5):576-585.
Intestinal microbiota metabolism of choline/phosphatidylcholine produces trimethylamine (TMA), which is further metabolized to a proatherogenic species, trimethylamine-N-oxide (TMAO). Herein we demonstrate that intestinal microbiota metabolism of dietary L-carnitine, a trimethylamine abundant in red meat, also produces TMAO and accelerates atherosclerosis. Omnivorous subjects are shown to produce significantly more TMAO than vegans/vegetarians following ingestion of L-carnitine through a microbiota-dependent mechanism. Specific bacterial taxa in human feces are shown to associate with both plasma TMAO and dietary status. Plasma L-carnitine levels in subjects undergoing cardiac evaluation (n = 2,595) predict increased risks for both prevalent cardiovascular disease (CVD) and incident major adverse cardiac events (MI, stroke or death), but only among subjects with concurrently high TMAO levels. Chronic dietary L-carnitine supplementation in mice significantly altered cecal microbial composition, markedly enhanced synthesis of TMA/TMAO, and increased atherosclerosis, but not following suppression of intestinal microbiota. Dietary supplementation of TMAO, or either carnitine or choline in mice with intact intestinal microbiota, significantly reduced reverse cholesterol transport in vivo. Intestinal microbiota may thus participate in the well-established link between increased red meat consumption and CVD risk.
PMCID: PMC3650111  PMID: 23563705
18.  U-statistics in Genetic Association Studies 
Human genetics  2012;131(9):1395-1401.
Many common human diseases are complex and are expected to be highly heterogeneous, with multiple causative loci and multiple rare and common variants at some of the causative loci contributing to the risk of these diseases. Data from the genome-wide association studies (GWAS) and metadata such as known gene functions and pathways provide the possibility of identifying genetic variants, genes and pathways that are associated with complex phenotypes. Single-marker based tests have been very successful in identifying thousands of genetic variants for hundreds of complex phenotypes. However, these variants only explain very small percentages of the heritabilities. To account for the locus- and allelic-heterogeneity, gene-based and pathway-based tests can be very useful in the next stage of the analysis of GWAS data. U-statistics, which summarize the genomic similarity between pair of individuals and link the genomic similarity to phenotype similarity, have proved to be very useful for testing the associations between a set of single nucleotide polymorphisms (SNPs) and the phenotypes. Compared to single marker analysis, the advantages afforded by the U-statistics-based methods is large when the number of markers involved is large. We review several formulations of U-statistics in genetic association studies and point out the links of these statistics with other similarity-based tests of genetic association. Finally, potential application of U-statistics in analysis of the next generation sequencing data and rare variants association studies are discussed.
PMCID: PMC3419299  PMID: 22610525
19.  MATCHCLIP: locate precise breakpoints for copy number variation using CIGAR string by matching soft clipped reads 
Frontiers in Genetics  2013;4:157.
Copy number variations (CNVs) are associated with many complex diseases. Next generation sequencing data enable one to identify precise CNV breakpoints to better under the underlying molecular mechanisms and to design more efficient assays. Using the CIGAR strings of the reads, we develop a method that can identify the exact CNV breakpoints, and in cases when the breakpoints are in a repeated region, the method reports a range where the breakpoints can slide. Our method identifies the breakpoints of a CNV using both the positions and CIGAR strings of the reads that cover breakpoints of a CNV. A read with a long soft clipped part (denoted as S in CIGAR) at its 3′(right) end can be used to identify the 5′(left)-side of the breakpoints, and a read with a long S part at the 5′ end can be used to identify the breakpoint at the 3′-side. To ensure both types of reads cover the same CNV, we require the overlapped common string to include both of the soft clipped parts. When a CNV starts and ends in the same repeated regions, its breakpoints are not unique, in which case our method reports the left most positions for the breakpoints and a range within which the breakpoints can be incremented without changing the variant sequence. We have implemented the methods in a C++ package intended for the current Illumina Miseq and Hiseq platforms for both whole genome and exon-sequencing. Our simulation studies have shown that our method compares favorably with other similar methods in terms of true discovery rate, false positive rate and breakpoint accuracy. Our results from a real application have shown that the detected CNVs are consistent with zygosity and read depth information. The software package is available at
PMCID: PMC3744852  PMID: 23967014
structural variation; breakpoint; duplication; deletion; exon sequencing
20.  A Gaussian copula approach for the analysis of secondary phenotypes in case–control genetic association studies 
Biostatistics (Oxford, England)  2011;13(3):497-508.
In many case–control genetic association studies, a set of correlated secondary phenotypes that may share common genetic factors with disease status are collected. Examination of these secondary phenotypes can yield valuable insights about the disease etiology and supplement the main studies. However, due to unequal sampling probabilities between cases and controls, standard regression analysis that assesses the effect of SNPs (single nucleotide polymorphisms) on secondary phenotypes using cases only, controls only, or combined samples of cases and controls can yield inflated type I error rates when the test SNP is associated with the disease. To solve this issue, we propose a Gaussian copula-based approach that efficiently models the dependence between disease status and secondary phenotypes. Through simulations, we show that our method yields correct type I error rates for the analysis of secondary phenotypes under a wide range of situations. To illustrate the effectiveness of our method in the analysis of real data, we applied our method to a genome-wide association study on high-density lipoprotein cholesterol (HDL-C), where “cases” are defined as individuals with extremely high HDL-C level and “controls” are defined as those with low HDL-C level. We treated 4 quantitative traits with varying degrees of correlation with HDL-C as secondary phenotypes and tested for association with SNPs in LIPG, a gene that is well known to be associated with HDL-C. We show that when the correlation between the primary and secondary phenotypes is >0.2, the P values from case–control combined unadjusted analysis are much more significant than methods that aim to correct for ascertainment bias. Our results suggest that to avoid false-positive associations, it is important to appropriately model secondary phenotypes in case–control genetic association studies.
PMCID: PMC3372941  PMID: 21933777
Case–control studies; Statistical genetics; Statistical methods in Epidemiology
21.  Archaea and Fungi of the Human Gut Microbiome: Correlations with Diet and Bacterial Residents 
PLoS ONE  2013;8(6):e66019.
Diet influences health as a source of nutrients and toxins, and by shaping the composition of resident microbial populations. Previous studies have begun to map out associations between diet and the bacteria and viruses of the human gut microbiome. Here we investigate associations of diet with fungal and archaeal populations, taking advantage of samples from 98 well-characterized individuals. Diet was quantified using inventories scoring both long-term and recent diet, and archaea and fungi were characterized by deep sequencing of marker genes in DNA purified from stool. For fungi, we found 66 genera, with generally mutually exclusive presence of either the phyla Ascomycota or Basiodiomycota. For archaea, Methanobrevibacter was the most prevalent genus, present in 30% of samples. Several other archaeal genera were detected in lower abundance and frequency. Myriad associations were detected for fungi and archaea with diet, with each other, and with bacterial lineages. Methanobrevibacter and Candida were positively associated with diets high in carbohydrates, but negatively with diets high in amino acids, protein, and fatty acids. A previous study emphasized that bacterial population structure was associated primarily with long-term diet, but high Candida abundance was most strongly associated with the recent consumption of carbohydrates. Methobrevibacter abundance was associated with both long term and recent consumption of carbohydrates. These results confirm earlier targeted studies and provide a host of new associations to consider in modeling the effects of diet on the gut microbiome and human health.
PMCID: PMC3684604  PMID: 23799070
22.  Association of CETP Taq1B and -629C > A polymorphisms with coronary artery disease and lipid levels in the multi-ethnic Singaporean population 
Hyperlipidaemia is a major risk factor for coronary artery disease (CAD) and cholesteryl ester transfer protein (CETP) gene polymorphisms are known to be associated with lipid profiles.
In this study, we investigated the association of two polymorphisms in the CETP, Taq1B (rs708272) and -629C > A (rs1800775), with CAD and lipid levels HDL-C in 662 CAD + cases and 927 controls from the Singapore population comprising Chinese, Malays and Indians.
TaqB2 frequency was significantly lowest in the Malays (0.43) followed by Chinese (0.47) and highest in the Indians (0.56) in the controls. The B2 allele frequency was significantly lower in the Chinese CAD + cases compared to the controls (p = 0.002). The absence of the B2 allele was associated with CAD with an OR 2.0 (95% CI 1.2 to 3.4) after adjustment for the confounding effects of age, smoking, BMI, gender, hypertension, dyslipidemia and diabetes mellitus. The B2 allele was significantly associated with higher plasma HDL-C levels in the Chinese men after adjusting for confounders. Associations with plasma apoA1 levels were significant only in the Chinese men for Taq1B and -629C > A. In addition, the Taq1B polymorphism was only associated with plasma Apo B and Lp(a) in the Malay men. Significant associations were only found in non-smoking subjects with BMI <50th percentile. In this study, the LD coefficients between the Taq1B and -629C > A polymorphisms seemed to be weak.
The absence the Taq1B2 allele was associated with CAD in the Chinese population only and the minor allele of the Taq1B polymorphism of the CETP gene was significantly associated with higher plasma HDL-C levels in Chinese men.
PMCID: PMC3699414  PMID: 23758630
Cholesteryl ester transfer protein; -629C > A polymorphism; TaqB1 polymorphism; HDL-cholesterol; Coronary artery disease
23.  Model Selection and Estimation in the Matrix Normal Graphical Model 
Motivated by analysis of gene expression data measured over different tissues or over time, we consider matrix-valued random variable and matrix-normal distribution, where the precision matrices have a graphical interpretation for genes and tissues, respectively. We present a l1 penalized likelihood method and an efficient coordinate descent-based computational algorithm for model selection and estimation in such matrix normal graphical models (MNGMs). We provide theoretical results on the asymptotic distributions, the rates of convergence of the estimates and the sparsistency, allowing both the numbers of genes and tissues to diverge as the sample size goes to infinity. Simulation results demonstrate that the MNGMs can lead to better estimate of the precision matrices and better identifications of the graph structures than the standard Gaussian graphical models. We illustrate the methods with an analysis of mouse gene expression data measured over ten different tissues.
PMCID: PMC3285238  PMID: 22368309
Gaussian graphical model; Gene networks; High dimensional data; l1 penalized likelihood; Matrix normal distribution; Sparsistency
24.  A Sparse Structured Shrinkage Estimator for Nonparametric Varying-Coefficient Model with an Application in Genomics 
Many problems in genomics are related to variable selection where high-dimensional genomic data are treated as covariates. Such genomic covariates often have certain structures and can be represented as vertices of an undirected graph. Biological processes also vary as functions depending upon some biological state, such as time. High-dimensional variable selection where covariates are graph-structured and underlying model is nonparametric presents an important but largely unaddressed statistical challenge. Motivated by the problem of regression-based motif discovery, we consider the problem of variable selection for high-dimensional nonparametric varying-coefficient models and introduce a sparse structured shrinkage (SSS) estimator based on basis function expansions and a novel smoothed penalty function. We present an efficient algorithm for computing the SSS estimator. Results on model selection consistency and estimation bounds are derived. Moreover, finite-sample performances are studied via simulations, and the effects of high-dimensionality and structural information of the covariates are especially highlighted. We apply our method to motif finding problem using a yeast cell-cycle gene expression dataset and word counts in genes’ promoter sequences. Our results demonstrate that the proposed method can result in better variable selection and prediction for high-dimensional regression when the underlying model is nonparametric and covariates are structured. Supplemental materials for the article are available online.
PMCID: PMC3419598  PMID: 22904608
High-dimensional data; Model selection; Motif analysis; Nonparametric regression; Sparsity; Structured covariates
25.  Optimal Sparse Segment Identification with Application in Copy Number Variation Analysis 
Motivated by DNA copy number variation (CNV) analysis based on high-density single nucleotide polymorphism (SNP) data, we consider the problem of detecting and identifying sparse short segments in a long one-dimensional sequence of data with additive Gaussian white noise, where the number, length and location of the segments are unknown. We present a statistical characterization of the identifiable region of a segment where it is possible to reliably separate the segment from noise. An efficient likelihood ratio selection (LRS) procedure for identifying the segments is developed, and the asymptotic optimality of this method is presented in the sense that the LRS can separate the signal segments from the noise as long as the signal segments are in the identifiable regions. The proposed method is demonstrated with simulations and analysis of a real data set on identification of copy number variants based on high-density SNP data. The results show that the LRS procedure can yield greater gain in power for detecting the true segments than some standard signal identification methods.
PMCID: PMC3610602  PMID: 23543902
Likelihood ratio selection; signal detection; multiple testing; DNA copy number

Results 1-25 (68)