PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (63)
 

Clipboard (0)
None

Select a Filter Below

Journals
more »
Year of Publication
more »
Document Types
author:("Li, honghe")
1.  Structure-constrained sparse canonical correlation analysis with an application to microbiome data analysis 
Biostatistics (Oxford, England)  2012;14(2):244-258.
Motivated by studying the association between nutrient intake and human gut microbiome composition, we developed a method for structure-constrained sparse canonical correlation analysis (ssCCA) in a high-dimensional setting. ssCCA takes into account the phylogenetic relationships among bacteria, which provides important prior knowledge on evolutionary relationships among bacterial taxa. Our ssCCA formulation utilizes a phylogenetic structure-constrained penalty function to impose certain smoothness on the linear coefficients according to the phylogenetic relationships among the taxa. An efficient coordinate descent algorithm is developed for optimization. A human gut microbiome data set is used to illustrate this method. Both simulations and real data applications show that ssCCA performs better than the standard sparse CCA in identifying meaningful variables when there are structures in the data.
doi:10.1093/biostatistics/kxs038
PMCID: PMC3590923  PMID: 23074263
Dimension reduction; Graph; Phylogenetic tree; Regularization; Variable selection
2.  Adjusting for High-dimensional Covariates in Sparse Precision Matrix Estimation by ℓ1-Penalization 
Journal of multivariate analysis  2013;116:10.1016/j.jmva.2013.01.005.
Motivated by the analysis of genetical genomic data, we consider the problem of estimating high-dimensional sparse precision matrix adjusting for possibly a large number of covariates, where the covariates can affect the mean value of the random vector. We develop a two-stage estimation procedure to first identify the relevant covariates that affect the means by a joint ℓ1 penalization. The estimated regression coefficients are then used to estimate the mean values in a multivariate sub-Gaussian model in order to estimate the sparse precision matrix through a ℓ1-penalized log-determinant Bregman divergence. Under the multivariate normal assumption, the precision matrix has the interpretation of a conditional Gaussian graphical model. We show that under some regularity conditions, the estimates of the regression coefficients are consistent in element-wise ℓ∞ norm, Frobenius norm and also spectral norm even when p ≫ n and q ≫ n. We also show that with probability converging to one, the estimate of the precision matrix correctly specifies the zero pattern of the true precision matrix. We illustrate our theoretical results via simulations and demonstrate that the method can lead to improved estimate of the precision matrix. We apply the method to an analysis of a yeast genetical genomic data.
doi:10.1016/j.jmva.2013.01.005
PMCID: PMC3653344  PMID: 23687392
Estimation bounds; Graphical Model; Model selection consistency; Oracle property
3.  VARIABLE SELECTION FOR SPARSE DIRICHLET-MULTINOMIAL REGRESSION WITH AN APPLICATION TO MICROBIOME DATA ANALYSIS* 
The annals of applied statistics  2013;7(1):10.1214/12-AOAS592.
With the development of next generation sequencing technology, researchers have now been able to study the microbiome composition using direct sequencing, whose output are bacterial taxa counts for each microbiome sample. One goal of microbiome study is to associate the microbiome composition with environmental covariates. We propose to model the taxa counts using a Dirichlet-multinomial (DM) regression model in order to account for overdispersion of observed counts. The DM regression model can be used for testing the association between taxa composition and covariates using the likelihood ratio test. However, when the number of the covariates is large, multiple testing can lead to loss of power. To deal with the high dimensionality of the problem, we develop a penalized likelihood approach to estimate the regression parameters and to select the variables by imposing a sparse group ℓ1 penalty to encourage both group-level and within-group sparsity. Such a variable selection procedure can lead to selection of the relevant covariates and their associated bacterial taxa. An efficient block-coordinate descent algorithm is developed to solve the optimization problem. We present extensive simulations to demonstrate that the sparse DM regression can result in better identification of the microbiome-associated covariates than models that ignore overdispersion or only consider the proportions. We demonstrate the power of our method in an analysis of a data set evaluating the effects of nutrient intake on human gut microbiome composition. Our results have clearly shown that the nutrient intake is strongly associated with the human gut microbiome.
doi:10.1214/12-AOAS592
PMCID: PMC3846354  PMID: 24312162
Coordinate descent; Counts data; Overdispersion; Regularized likelihood; Sparse group penalty
4.  VARIABLE SELECTION AND ESTIMATION IN HIGH-DIMENSIONAL VARYING-COEFFICIENT MODELS 
Statistica Sinica  2011;21(4):1515-1540.
Nonparametric varying coefficient models are useful for studying the time-dependent effects of variables. Many procedures have been developed for estimation and variable selection in such models. However, existing work has focused on the case when the number of variables is fixed or smaller than the sample size. In this paper, we consider the problem of variable selection and estimation in varying coefficient models in sparse, high-dimensional settings when the number of variables can be larger than the sample size. We apply the group Lasso and basis function expansion to simultaneously select the important variables and estimate the nonzero varying coefficient functions. Under appropriate conditions, we show that the group Lasso selects a model of the right order of dimensionality, selects all variables with the norms of the corresponding coefficient functions greater than certain threshold level, and is estimation consistent. However, the group Lasso is in general not selection consistent and tends to select variables that are not important in the model. In order to improve the selection results, we apply the adaptive group Lasso. We show that, under suitable conditions, the adaptive group Lasso has the oracle selection property in the sense that it correctly selects important variables with probability converging to one. In contrast, the group Lasso does not possess such oracle property. Both approaches are evaluated using simulation and demonstrated on a data example.
doi:10.5705/ss.2009.316
PMCID: PMC3902862  PMID: 24478564
Basis expansion; group Lasso; high-dimensional data; non-parametric coefficient function; selection consistency; sparsity
5.  LEARNING LOCAL DIRECTED ACYCLIC GRAPHS BASED ON MULTIVARIATE TIME SERIES DATA* 
The annals of applied statistics  2013;7(3):1249-1835.
Multivariate time series (MTS) data such as time course gene expression data in genomics are often collected to study the dynamic nature of the systems. These data provide important information about the causal dependency among a set of random variables. In this paper, we introduce a computationally efficient algorithm to learn directed acyclic graphs (DAGs) based on MTS data, focusing on learning the local structure of a given target variable. Our algorithm is based on learning all parents (P), all children (C) and some descendants (D) (PCD) iteratively, utilizing the time order of the variables to orient the edges. This time series PCD-PCD algorithm (tsPCD-PCD) extends the previous PCD-PCD algorithm to dependent observations and utilizes composite likelihood ratio tests (CLRTs) for testing the conditional independence. We present the asymptotic distribution of the CLRT statistic and show that the tsPCD-PCD is guaranteed to recover the true DAG structure when the faithfulness condition holds and the tests correctly reject the null hypotheses. Simulation studies show that the CLRTs are valid and perform well even when the sample sizes are small. In addition, the tsPCD-PCD algorithm outperforms the PCD-PCD algorithm in recovering the local graph structures. We illustrate the algorithm by analyzing a time course gene expression data related to mouse T-cell activation.
PMCID: PMC3898602  PMID: 24465291
Bayesian network; Composite likelihood ratio test; Genetic network; PCD-PCD algorithm
6.  Network-based Analysis of Multivariate Gene Expression Data 
Multivariate microarray gene expression data are commonly collected to study the genomic responses under ordered conditions such as over increasing/decreasing dose levels or over time during biological processes, where the expression levels of a give gene are expected to be dependent. One important question from such multivariate gene expression experiments is to identify genes that show different expression patterns over treatment dosages or over time; these genes can also point to the pathways that are perturbed during a given biological process. Several empirical Bayes approaches have been developed for identifying the differentially expressed genes in order to account for the parallel structure of the data and to borrow information across all the genes. However, these methods assume that the genes are independent. In this paper, we introduce an alternative empirical Bayes approach for analysis of multivariate gene expression data by assuming a discrete Markov random field (MRF) prior, where the dependency of the differential expression patterns of genes on the networks are modeled by a Markov random field. Simulation studies indicated that the method is quite effective in identifying genes and the modified subnetworks and has higher sensitivity than the commonly used procedures that do not use the pathway information, with similar observed false discovery rates. We applied the proposed methods for analysis of a microarray time course gene expression study of TrkA- and TrkB-transfected neuroblastoma cell lines and identified genes and subnetworks on MAPK, focal adhesion and prion disease pathways that may explain cell differentiation in TrkA-transfected cell lines.
doi:10.1007/978-1-60327-337-4_8
PMCID: PMC3692268  PMID: 23385535
Markov random field; empirical Bayes; KEGG pathways
7.  Simultaneous Discovery of Rare and Common Segment Variants 
Biometrika  2012;100(1):157-172.
Summary
Copy number variant is an important type of genetic structural variation appearing in germline DNA, ranging from common to rare in a population. Both rare and common copy number variants have been reported to be associated with complex diseases, so it is therefore important to simultaneously identify both based on a large set of population samples. We develop a proportion adaptive segment selection procedure that automatically adjusts to the unknown proportions of the carriers of the segment variants. We characterize the detection boundary that separates the region where a segment variant is detectable by some method from the region where it cannot be detected. Although the detection boundaries are very different for the rare and common segment variants, it is shown that the proposed procedure can reliably identify both whenever they are detectable. Compared with methods for single sample analysis, this procedure gains power by pooling information from multiple samples. The method is applied to analyze neuroblastoma samples and identifies a large number of copy number variants that are missed by single-sample methods.
doi:10.1093/biomet/ass059
PMCID: PMC3696347  PMID: 23825436
DNA copy number variant; Information pooling; Population structural variant
8.  PennSeq: accurate isoform-specific gene expression quantification in RNA-Seq by modeling non-uniform read distribution 
Nucleic Acids Research  2013;42(3):e20.
Correctly estimating isoform-specific gene expression is important for understanding complicated biological mechanisms and for mapping disease susceptibility genes. However, estimating isoform-specific gene expression is challenging because various biases present in RNA-Seq (RNA sequencing) data complicate the analysis, and if not appropriately corrected, can affect isoform expression estimation and downstream analysis. In this article, we present PennSeq, a statistical method that allows each isoform to have its own non-uniform read distribution. Instead of making parametric assumptions, we give adequate weight to the underlying data by the use of a non-parametric approach. Our rationale is that regardless what factors lead to non-uniformity, whether it is due to hexamer priming bias, local sequence bias, positional bias, RNA degradation, mapping bias or other unknown reasons, the probability that a fragment is sampled from a particular region will be reflected in the aligned data. This empirical approach thus maximally reflects the true underlying non-uniform read distribution. We evaluate the performance of PennSeq using both simulated data with known ground truth, and using two real Illumina RNA-Seq data sets including one with quantitative real time polymerase chain reaction measurements. Our results indicate superior performance of PennSeq over existing methods, particularly for isoforms demonstrating severe non-uniformity. PennSeq is freely available for download at http://sourceforge.net/projects/pennseq.
doi:10.1093/nar/gkt1304
PMCID: PMC3919567  PMID: 24362841
9.  Robust Gaussian Graphical Modeling via l1 Penalization 
Biometrics  2012;68(4):1197-1206.
Summary
Gaussian graphical models have been widely used as an effective method for studying the conditional independency structure among genes and for constructing genetic networks. However, gene expression data typically have heavier tails or more outlying observations than the standard Gaussian distribution. Such outliers in gene expression data can lead to wrong inference on the dependency structure among the genes. We propose a l1 penalized estimation procedure for the sparse Gaussian graphical models that is robustified against possible outliers. The likelihood function is weighted according to how the observation is deviated, where the deviation of the observation is measured based on its own likelihood. An efficient computational algorithm based on the coordinate gradient descent method is developed to obtain the minimizer of the negative penalized robustified-likelihood, where nonzero elements of the concentration matrix represents the graphical links among the genes. After the graphical structure is obtained, we re-estimate the positive definite concentration matrix using an iterative proportional fitting algorithm. Through simulations, we demonstrate that the proposed robust method performs much better than the graphical Lasso for the Gaussian graphical models in terms of both graph structure selection and estimation when outliers are present. We apply the robust estimation procedure to an analysis of yeast gene expression data and show that the resulting graph has better biological interpretation than that obtained from the graphical Lasso.
doi:10.1111/j.1541-0420.2012.01785.x
PMCID: PMC3535542  PMID: 23020775
Coordinate descent algorithm; Genetic Network; Iterative proportional fitting; Outliers; Penalized likelihood
10.  Robust Detection and Identification of Sparse Segments in Ultra-High Dimensional Data Analysis 
Summary
Copy number variants (CNVs) are alternations of DNA of a genome that results in the cell having a less or more than two copies of segments of the DNA. CNVs correspond to relatively large regions of the genome, ranging from about one kilobase to several megabases, that are deleted or duplicated. Motivated by CNV analysis based on next generation sequencing data, we consider the problem of detecting and identifying sparse short segments hidden in a long linear sequence of data with an unspecified noise distribution. We propose a computationally efficient method that provides a robust and near-optimal solution for segment identification over a wide range of noise distributions. We theoretically quantify the conditions for detecting the segment signals and show that the method near-optimally estimates the signal segments whenever it is possible to detect their existence. Simulation studies are carried out to demonstrate the efficiency of the method under different noise distributions. We present results from a CNV analysis of a HapMap Yoruban sample to further illustrate the theory and the methods.
doi:10.1111/j.1467-9868.2012.01028.x
PMCID: PMC3563068  PMID: 23393425
Robust segment detector; Robust segment identifier; optimality; DNA copy number variant; next generation sequencing data
11.  Intestinal microbiota metabolism of L-carnitine, a nutrient in red meat, promotes atherosclerosis 
Nature medicine  2013;19(5):576-585.
Intestinal microbiota metabolism of choline/phosphatidylcholine produces trimethylamine (TMA), which is further metabolized to a proatherogenic species, trimethylamine-N-oxide (TMAO). Herein we demonstrate that intestinal microbiota metabolism of dietary L-carnitine, a trimethylamine abundant in red meat, also produces TMAO and accelerates atherosclerosis. Omnivorous subjects are shown to produce significantly more TMAO than vegans/vegetarians following ingestion of L-carnitine through a microbiota-dependent mechanism. Specific bacterial taxa in human feces are shown to associate with both plasma TMAO and dietary status. Plasma L-carnitine levels in subjects undergoing cardiac evaluation (n = 2,595) predict increased risks for both prevalent cardiovascular disease (CVD) and incident major adverse cardiac events (MI, stroke or death), but only among subjects with concurrently high TMAO levels. Chronic dietary L-carnitine supplementation in mice significantly altered cecal microbial composition, markedly enhanced synthesis of TMA/TMAO, and increased atherosclerosis, but not following suppression of intestinal microbiota. Dietary supplementation of TMAO, or either carnitine or choline in mice with intact intestinal microbiota, significantly reduced reverse cholesterol transport in vivo. Intestinal microbiota may thus participate in the well-established link between increased red meat consumption and CVD risk.
doi:10.1038/nm.3145
PMCID: PMC3650111  PMID: 23563705
12.  U-statistics in Genetic Association Studies 
Human genetics  2012;131(9):1395-1401.
Many common human diseases are complex and are expected to be highly heterogeneous, with multiple causative loci and multiple rare and common variants at some of the causative loci contributing to the risk of these diseases. Data from the genome-wide association studies (GWAS) and metadata such as known gene functions and pathways provide the possibility of identifying genetic variants, genes and pathways that are associated with complex phenotypes. Single-marker based tests have been very successful in identifying thousands of genetic variants for hundreds of complex phenotypes. However, these variants only explain very small percentages of the heritabilities. To account for the locus- and allelic-heterogeneity, gene-based and pathway-based tests can be very useful in the next stage of the analysis of GWAS data. U-statistics, which summarize the genomic similarity between pair of individuals and link the genomic similarity to phenotype similarity, have proved to be very useful for testing the associations between a set of single nucleotide polymorphisms (SNPs) and the phenotypes. Compared to single marker analysis, the advantages afforded by the U-statistics-based methods is large when the number of markers involved is large. We review several formulations of U-statistics in genetic association studies and point out the links of these statistics with other similarity-based tests of genetic association. Finally, potential application of U-statistics in analysis of the next generation sequencing data and rare variants association studies are discussed.
doi:10.1007/s00439-012-1178-y
PMCID: PMC3419299  PMID: 22610525
13.  MATCHCLIP: locate precise breakpoints for copy number variation using CIGAR string by matching soft clipped reads 
Frontiers in Genetics  2013;4:157.
Copy number variations (CNVs) are associated with many complex diseases. Next generation sequencing data enable one to identify precise CNV breakpoints to better under the underlying molecular mechanisms and to design more efficient assays. Using the CIGAR strings of the reads, we develop a method that can identify the exact CNV breakpoints, and in cases when the breakpoints are in a repeated region, the method reports a range where the breakpoints can slide. Our method identifies the breakpoints of a CNV using both the positions and CIGAR strings of the reads that cover breakpoints of a CNV. A read with a long soft clipped part (denoted as S in CIGAR) at its 3′(right) end can be used to identify the 5′(left)-side of the breakpoints, and a read with a long S part at the 5′ end can be used to identify the breakpoint at the 3′-side. To ensure both types of reads cover the same CNV, we require the overlapped common string to include both of the soft clipped parts. When a CNV starts and ends in the same repeated regions, its breakpoints are not unique, in which case our method reports the left most positions for the breakpoints and a range within which the breakpoints can be incremented without changing the variant sequence. We have implemented the methods in a C++ package intended for the current Illumina Miseq and Hiseq platforms for both whole genome and exon-sequencing. Our simulation studies have shown that our method compares favorably with other similar methods in terms of true discovery rate, false positive rate and breakpoint accuracy. Our results from a real application have shown that the detected CNVs are consistent with zygosity and read depth information. The software package is available at http://statgene.med.upenn.edu/softprog.html.
doi:10.3389/fgene.2013.00157
PMCID: PMC3744852  PMID: 23967014
structural variation; breakpoint; duplication; deletion; exon sequencing
14.  A Gaussian copula approach for the analysis of secondary phenotypes in case–control genetic association studies 
Biostatistics (Oxford, England)  2011;13(3):497-508.
In many case–control genetic association studies, a set of correlated secondary phenotypes that may share common genetic factors with disease status are collected. Examination of these secondary phenotypes can yield valuable insights about the disease etiology and supplement the main studies. However, due to unequal sampling probabilities between cases and controls, standard regression analysis that assesses the effect of SNPs (single nucleotide polymorphisms) on secondary phenotypes using cases only, controls only, or combined samples of cases and controls can yield inflated type I error rates when the test SNP is associated with the disease. To solve this issue, we propose a Gaussian copula-based approach that efficiently models the dependence between disease status and secondary phenotypes. Through simulations, we show that our method yields correct type I error rates for the analysis of secondary phenotypes under a wide range of situations. To illustrate the effectiveness of our method in the analysis of real data, we applied our method to a genome-wide association study on high-density lipoprotein cholesterol (HDL-C), where “cases” are defined as individuals with extremely high HDL-C level and “controls” are defined as those with low HDL-C level. We treated 4 quantitative traits with varying degrees of correlation with HDL-C as secondary phenotypes and tested for association with SNPs in LIPG, a gene that is well known to be associated with HDL-C. We show that when the correlation between the primary and secondary phenotypes is >0.2, the P values from case–control combined unadjusted analysis are much more significant than methods that aim to correct for ascertainment bias. Our results suggest that to avoid false-positive associations, it is important to appropriately model secondary phenotypes in case–control genetic association studies.
doi:10.1093/biostatistics/kxr025
PMCID: PMC3372941  PMID: 21933777
Case–control studies; Statistical genetics; Statistical methods in Epidemiology
15.  Archaea and Fungi of the Human Gut Microbiome: Correlations with Diet and Bacterial Residents 
PLoS ONE  2013;8(6):e66019.
Diet influences health as a source of nutrients and toxins, and by shaping the composition of resident microbial populations. Previous studies have begun to map out associations between diet and the bacteria and viruses of the human gut microbiome. Here we investigate associations of diet with fungal and archaeal populations, taking advantage of samples from 98 well-characterized individuals. Diet was quantified using inventories scoring both long-term and recent diet, and archaea and fungi were characterized by deep sequencing of marker genes in DNA purified from stool. For fungi, we found 66 genera, with generally mutually exclusive presence of either the phyla Ascomycota or Basiodiomycota. For archaea, Methanobrevibacter was the most prevalent genus, present in 30% of samples. Several other archaeal genera were detected in lower abundance and frequency. Myriad associations were detected for fungi and archaea with diet, with each other, and with bacterial lineages. Methanobrevibacter and Candida were positively associated with diets high in carbohydrates, but negatively with diets high in amino acids, protein, and fatty acids. A previous study emphasized that bacterial population structure was associated primarily with long-term diet, but high Candida abundance was most strongly associated with the recent consumption of carbohydrates. Methobrevibacter abundance was associated with both long term and recent consumption of carbohydrates. These results confirm earlier targeted studies and provide a host of new associations to consider in modeling the effects of diet on the gut microbiome and human health.
doi:10.1371/journal.pone.0066019
PMCID: PMC3684604  PMID: 23799070
16.  Association of CETP Taq1B and -629C > A polymorphisms with coronary artery disease and lipid levels in the multi-ethnic Singaporean population 
Background
Hyperlipidaemia is a major risk factor for coronary artery disease (CAD) and cholesteryl ester transfer protein (CETP) gene polymorphisms are known to be associated with lipid profiles.
Methods
In this study, we investigated the association of two polymorphisms in the CETP, Taq1B (rs708272) and -629C > A (rs1800775), with CAD and lipid levels HDL-C in 662 CAD + cases and 927 controls from the Singapore population comprising Chinese, Malays and Indians.
Results
TaqB2 frequency was significantly lowest in the Malays (0.43) followed by Chinese (0.47) and highest in the Indians (0.56) in the controls. The B2 allele frequency was significantly lower in the Chinese CAD + cases compared to the controls (p = 0.002). The absence of the B2 allele was associated with CAD with an OR 2.0 (95% CI 1.2 to 3.4) after adjustment for the confounding effects of age, smoking, BMI, gender, hypertension, dyslipidemia and diabetes mellitus. The B2 allele was significantly associated with higher plasma HDL-C levels in the Chinese men after adjusting for confounders. Associations with plasma apoA1 levels were significant only in the Chinese men for Taq1B and -629C > A. In addition, the Taq1B polymorphism was only associated with plasma Apo B and Lp(a) in the Malay men. Significant associations were only found in non-smoking subjects with BMI <50th percentile. In this study, the LD coefficients between the Taq1B and -629C > A polymorphisms seemed to be weak.
Conclusion
The absence the Taq1B2 allele was associated with CAD in the Chinese population only and the minor allele of the Taq1B polymorphism of the CETP gene was significantly associated with higher plasma HDL-C levels in Chinese men.
doi:10.1186/1476-511X-12-85
PMCID: PMC3699414  PMID: 23758630
Cholesteryl ester transfer protein; -629C > A polymorphism; TaqB1 polymorphism; HDL-cholesterol; Coronary artery disease
17.  Model Selection and Estimation in the Matrix Normal Graphical Model 
Motivated by analysis of gene expression data measured over different tissues or over time, we consider matrix-valued random variable and matrix-normal distribution, where the precision matrices have a graphical interpretation for genes and tissues, respectively. We present a l1 penalized likelihood method and an efficient coordinate descent-based computational algorithm for model selection and estimation in such matrix normal graphical models (MNGMs). We provide theoretical results on the asymptotic distributions, the rates of convergence of the estimates and the sparsistency, allowing both the numbers of genes and tissues to diverge as the sample size goes to infinity. Simulation results demonstrate that the MNGMs can lead to better estimate of the precision matrices and better identifications of the graph structures than the standard Gaussian graphical models. We illustrate the methods with an analysis of mouse gene expression data measured over ten different tissues.
doi:10.1016/j.jmva.2012.01.005
PMCID: PMC3285238  PMID: 22368309
Gaussian graphical model; Gene networks; High dimensional data; l1 penalized likelihood; Matrix normal distribution; Sparsistency
18.  A Sparse Structured Shrinkage Estimator for Nonparametric Varying-Coefficient Model with an Application in Genomics 
Many problems in genomics are related to variable selection where high-dimensional genomic data are treated as covariates. Such genomic covariates often have certain structures and can be represented as vertices of an undirected graph. Biological processes also vary as functions depending upon some biological state, such as time. High-dimensional variable selection where covariates are graph-structured and underlying model is nonparametric presents an important but largely unaddressed statistical challenge. Motivated by the problem of regression-based motif discovery, we consider the problem of variable selection for high-dimensional nonparametric varying-coefficient models and introduce a sparse structured shrinkage (SSS) estimator based on basis function expansions and a novel smoothed penalty function. We present an efficient algorithm for computing the SSS estimator. Results on model selection consistency and estimation bounds are derived. Moreover, finite-sample performances are studied via simulations, and the effects of high-dimensionality and structural information of the covariates are especially highlighted. We apply our method to motif finding problem using a yeast cell-cycle gene expression dataset and word counts in genes’ promoter sequences. Our results demonstrate that the proposed method can result in better variable selection and prediction for high-dimensional regression when the underlying model is nonparametric and covariates are structured. Supplemental materials for the article are available online.
doi:10.1198/jcgs.2011.10102
PMCID: PMC3419598  PMID: 22904608
High-dimensional data; Model selection; Motif analysis; Nonparametric regression; Sparsity; Structured covariates
19.  Optimal Sparse Segment Identification with Application in Copy Number Variation Analysis 
Motivated by DNA copy number variation (CNV) analysis based on high-density single nucleotide polymorphism (SNP) data, we consider the problem of detecting and identifying sparse short segments in a long one-dimensional sequence of data with additive Gaussian white noise, where the number, length and location of the segments are unknown. We present a statistical characterization of the identifiable region of a segment where it is possible to reliably separate the segment from noise. An efficient likelihood ratio selection (LRS) procedure for identifying the segments is developed, and the asymptotic optimality of this method is presented in the sense that the LRS can separate the signal segments from the noise as long as the signal segments are in the identifiable regions. The proposed method is demonstrated with simulations and analysis of a real data set on identification of copy number variants based on high-density SNP data. The results show that the LRS procedure can yield greater gain in power for detecting the true segments than some standard signal identification methods.
doi:10.1198/jasa.2010.tm10083
PMCID: PMC3610602  PMID: 23543902
Likelihood ratio selection; signal detection; multiple testing; DNA copy number
20.  Sociability and brain development in BALB/cJ and C57BL/6J mice 
Behavioural brain research  2011;228(2):299-310.
Sociability—the tendency to seek social interaction–propels the development of social cognition and social skills, but is disrupted in autism spectrum disorders (ASD). BALB/cJ and C57BL/6J inbred mouse strains are useful models of low and high levels of juvenile sociability, respectively, but the neurobiological and developmental factors that account for the strains’ contrasting sociability levels are largely unknown. We hypothesized that BALB/cJ mice would show increasing sociability with age but that C57BL/6J mice would show high sociability throughout development. We also hypothesized that littermates would resemble one another in sociability more than non-littermates. Finally, we hypothesized that low sociability would be associated with low corpus callosum size and increased brain size in BALB/cJ mice. Separate cohorts of C57BL/6J and BALB/cJ mice were tested for sociability at 19-, 23-, 31-, 42-, or 70-days-of-age, and brain weights and mid-sagittal corpus callosum area were measured. BALB/cJ sociability increased with age, and a strain by age interaction in sociability between 31 and 42 days of age suggested strong effects of puberty on sociability development. Sociability scores clustered according to litter membership in both strains, and perinatal litter size and sex ratio were identified as factors that contributed to this clustering in C57BL/6J, but not BALB/cJ, litters. There was no association between corpus callosum size and sociability, but smaller brains were associated with lower sociability in BALB/cJ mice. The associations reported here will provide directions for future mechanistic studies of sociability development.
doi:10.1016/j.bbr.2011.12.001
PMCID: PMC3474345  PMID: 22178318
Autism; Mouse; Model; Juvenile; Social; Behavior
21.  High-Dimensional Heteroscedastic Regression with an Application to eQTL Data Analysis 
Biometrics  2011;68(1):316-326.
Summary
We consider the problem of high-dimensional regression under non-constant error variances. Despite being a common phenomenon in biological applications, heteroscedasticity has, so far, been largely ignored in high-dimensional analysis of genomic data sets. We propose a new methodology that allows non-constant error variances for high-dimensional estimation and model selection. Our method incorporates heteroscedasticity by simultaneously modeling both the mean and variance components via a novel doubly regularized approach. Extensive Monte Carlo simulations indicate that our proposed procedure can result in better estimation and variable selection than existing methods when heteroscedasticity arises from the presence of predictors explaining error variances and outliers. Further, we demonstrate the presence of heteroscedasticity in and apply our method to an expression quantitative trait loci (eQTLs) study of 112 yeast segregants. The new procedure can automatically account for heteroscedasticity in identifying the eQTLs that are associated with gene expression variations and lead to smaller prediction errors. These results demonstrate the importance of considering heteroscedasticity in eQTL data analysis.
doi:10.1111/j.1541-0420.2011.01652.x
PMCID: PMC3218221  PMID: 22547833
Generalized least squares; Heteroscedasticity; Large p small n; Model selection; Sparse regression; Variance estimation
22.  Optimal False Discovery Rate Control for Dependent Data 
Statistics and its interface  2011;4(4):417-430.
This paper considers the problem of optimal false discovery rate control when the test statistics are dependent. An optimal joint oracle procedure, which minimizes the false non-discovery rate subject to a constraint on the false discovery rate is developed. A data-driven marginal plug-in procedure is then proposed to approximate the optimal joint procedure for multivariate normal data. It is shown that the marginal procedure is asymptotically optimal for multivariate normal data with a short-range dependent covariance structure. Numerical results show that the marginal procedure controls false discovery rate and leads to a smaller false non-discovery rate than several commonly used p-value based false discovery rate controlling methods. The procedure is illustrated by an application to a genome-wide association study of neuroblastoma and it identifies a few more genetic variants that are potentially associated with neuroblastoma than several p-value-based false discovery rate controlling procedures.
PMCID: PMC3559028  PMID: 23378870
Large scale multiple testing; Marginal rule; Optimal oracle rule; Weighted classification
23.  Optimal False Discovery Rate Control for Dependent Data 
Proteomics  2011;12(1):21-31.
This paper considers the problem of optimal false discovery rate control when the test statistics are dependent. An optimal joint oracle procedure, which minimizes the false non-discovery rate subject to a constraint on the false discovery rate is developed. A data-driven marginal plug-in procedure is then proposed to approximate the optimal joint procedure for multivariate normal data. It is shown that the marginal procedure is asymptotically optimal for multivariate normal data with a short-range dependent covariance structure. Numerical results show that the marginal procedure controls false discovery rate and leads to a smaller false non-discovery rate than several commonly used p-value based false discovery rate controlling methods. The procedure is illustrated by an application to a genome-wide association study of neuroblastoma and it identifies a few more genetic variants that are potentially associated with neuroblastoma than several p-value-based false discovery rate controlling procedures.
doi:10.1002/pmic.201100464
PMCID: PMC3415307  PMID: 22065615
Large scale multiple testing; Marginal rule; Optimal oracle rule; Weighted classification
24.  A SPARSE CONDITIONAL GAUSSIAN GRAPHICAL MODEL FOR ANALYSIS OF GENETICAL GENOMICS DATA* 
The annals of applied statistics  2011;5(4):2630-2650.
Genetical genomics experiments have now been routinely conducted to measure both the genetic markers and gene expression data on the same subjects. The gene expression levels are often treated as quantitative traits and are subject to standard genetic analysis in order to identify the gene expression quantitative loci (eQTL). However, the genetic architecture for many gene expressions may be complex, and poorly estimated genetic architecture may compromise the inferences of the dependency structures of the genes at the transcriptional level. In this paper, we introduce a sparse conditional Gaussian graphical model for studying the conditional independent relationships among a set of gene expressions adjusting for possible genetic effects where the gene expressions are modeled with seemingly unrelated regressions. We present an efficient coordinate descent algorithm to obtain the penalized estimation of both the regression coefficients and sparse concentration matrix. The corresponding graph can be used to determine the conditional independence among a group of genes while adjusting for shared genetic effects. Simulation experiments and asymptotic convergence rates and sparsistency are used to justify our proposed methods. By sparsistency, we mean the property that all parameters that are zero are actually estimated as zero with probability tending to one. We apply our methods to the analysis of a yeast eQTL data set and demonstrate that the conditional Gaussian graphical model leads to more interpretable gene network than standard Gaussian graphical model based on gene expression data alone.
doi:10.1214/11-AOAS494
PMCID: PMC3419502  PMID: 22905077
eQTL; Gaussian graphical model; Regularization; Genetic networks; Seemingly unrelated regression
25.  High Dimensional ODEs Coupled with Mixed-Effects Modeling Techniques for Dynamic Gene Regulatory Network Identification 
Gene regulation is a complicated process. The interaction of many genes and their products forms an intricate biological network. Identification of this dynamic network will help us understand the biological process in a systematic way. However, the construction of such a dynamic network is very challenging for a high-dimensional system. In this article we propose to use a set of ordinary differential equations (ODE), coupled with dimensional reduction by clustering and mixed-effects modeling techniques, to model the dynamic gene regulatory network (GRN). The ODE models allow us to quantify both positive and negative gene regulations as well as feedback effects of one set of genes in a functional module on the dynamic expression changes of the genes in another functional module, which results in a directed graph network. A five-step procedure, Clustering, Smoothing, regulation Identification, parameter Estimates refining and Function enrichment analysis (CSIEF) is developed to identify the ODE-based dynamic GRN. In the proposed CSIEF procedure, a series of cutting-edge statistical methods and techniques are employed, that include non-parametric mixed-effects models with a mixture distribution for clustering, nonparametric mixed-effects smoothing-based methods for ODE models, the smoothly clipped absolute deviation (SCAD)-based variable selection, and stochastic approximation EM (SAEM) approach for mixed-effects ODE model parameter estimation. The key step, the SCAD-based variable selection of the proposed procedure is justified by investigating its asymptotic properties and validated by Monte Carlo simulations. We apply the proposed method to identify the dynamic GRN for yeast cell cycle progression data. We are able to annotate the identified modules through function enrichment analyses. Some interesting biological findings are discussed. The proposed procedure is a promising tool for constructing a general dynamic GRN and more complicated dynamic networks.
doi:10.1198/jasa.2011.ap10194
PMCID: PMC3509540  PMID: 23204614
Differential equations; Network graph; NLME; Nonparametric mixed-effects model; Saccharomyces cerevisiae; SAEM; SCAD; Time course microarray data; Two-stage smoothing-based method; Yeast cell cycles

Results 1-25 (63)