In genome-wide association studies, penalization is an important approach for identifying genetic markers associated with disease. Motivated by the fact that there exists natural grouping structure in single nucleotide polymorphisms and, more importantly, such groups are correlated, we propose a new penalization method for group variable selection which can properly accommodate the correlation between adjacent groups. This method is based on a combination of the group Lasso penalty and a quadratic penalty on the difference of regression coefficients of adjacent groups. The new method is referred to as smoothed group Lasso (SGL). It encourages group sparsity and smoothes regression coefficients for adjacent groups. Canonical correlations are applied to the weights between groups in the quadratic difference penalty. We first derive a GCD algorithm for computing the solution path with linear regression model. The SGL method is further extended to logistic regression for binary response. With the assistance of the majorize–minimization algorithm, the SGL penalized logistic regression turns out to be an iteratively penalized least-square problem. We also suggest conducting principal component analysis to reduce the dimensionality within groups. Simulation studies are used to evaluate the finite sample performance. Comparison with group Lasso shows that SGL is more effective in selecting true positives. Two datasets are analyzed using the SGL method.
Group selection; Regularization; SNP; Smoothing
In high-throughput cancer genomic studies, markers identified from the analysis of single datasets may have unsatisfactory properties because of low sample sizes. Integrative analysis pools and analyzes raw data from multiple studies, and can effectively increase sample size and lead to improved marker identification results. In this study, we consider the integrative analysis of multiple high-throughput cancer prognosis studies. In the existing integrative analysis studies, the interplay among genes, which can be described using the network structure, has not been effectively accounted for. In network analysis, tightly-connected nodes (genes) are more likely to have related biological functions and similar regression coefficients. The goal of this study is to develop an analysis approach that can incorporate the gene network structure in integrative analysis. To this end, we adopt an AFT (accelerated failure time) model to describe survival. A weighted least squares approach, which has low computational cost, is adopted for estimation. For marker selection, we propose a new penalization approach. The proposed penalty is composed of two parts. The first part is a group MCP penalty, and conducts gene selection. The second part is a Laplacian penalty, and smoothes the differences of coefficients for tightly-connected genes. A group coordinate descent approach is developed to compute the proposed estimate. Simulation study shows satisfactory performance of the proposed approach when there exist moderate to strong correlations among genes. We analyze three lung cancer prognosis datasets, and demonstrate that incorporating the network structure can lead to the identification of important genes and improved prediction performance.
Integrative analysis; Cancer prognosis; Gene network; Penalized selection; Laplacian shrinkage
In cancer research, high-throughput profiling studies have been extensively conducted, searching for genes/SNPs associated with prognosis. Despite seemingly significant differences, different subtypes of the same cancer (or different types of cancers) may share common susceptibility genes. In this study, we analyze prognosis data on multiple subtypes of the same cancer, but note that the proposed approach is directly applicable to the analysis of data on multiple types of cancers. We describe the genetic basis of multiple subtypes using the heterogeneity model, which allows overlapping but different sets of susceptibility genes/SNPs for different subtypes. An accelerated failure time (AFT) model is adopted to describe prognosis. We develop a regularized gradient descent approach, which conducts gene-level analysis and identifies genes that contain important SNPs associated with prognosis. The proposed approach belongs to the family of gradient descent approaches, is intuitively reasonable, and has affordable computational cost. Simulation study shows that when prognosis-associated SNPs are clustered in a small number of genes, the proposed approach outperforms alternatives with significantly more true positives and fewer false positives. We analyze an NHL (non-Hodgkin lymphoma) prognosis study with SNP measurements, and identify genes associated with the three major subtypes of NHL, namely DLBCL, FL and CLL/SLL. The proposed approach identifies genes different from using alternative approaches and has the best prediction performance.
Integrative analysis; Cancer Prognosis; Gradient descent; NHL; SNP
In China, despite a high coverage rate, health insurance is not used for all illness episodes. Our goal is to identify subjects’ characteristics associated with insurance utilization and the association between utilization and medical expenditure.
A survey was conducted in January and February of 2012. 2093 middle-aged and elderly subjects (45 years old and above) were surveyed.
Heath insurance was not utilized for 12.6% (inpatient), 53.3% (outpatient), and 72.6% (self-treatment) of disease episodes. Subjects’ characteristics were associated with insurance utilization. Inpatient and outpatient treatments were expensive. In the multivariate analysis of outpatient treatment expenditure, insurance utilization was significantly associated with higher treatment cost, lost income, and gross total cost.
Utilization of health insurance may need to be improved. Insurance utilization can reduce out-of-pocket medical expenditure. However, the amount paid by the insured is still high. Policy intervention is needed to further improve the effectiveness of health insurance.
Genetic and other scientific studies routinely generate very many predictor variables, which can be naturally grouped, with predictors in the same groups being highly correlated. It is desirable to incorporate the hierarchical structure of the predictor variables into generalized linear models for simultaneous variable selection and coefficient estimation. We propose two prior distributions: hierarchical Cauchy and double-exponential distributions, on coefficients in generalized linear models. The hierarchical priors include both variable-specific and group-specific tuning parameters, thereby not only adopting different shrinkage for different coefficients and different groups but also providing a way to pool the information within groups. We fit generalized linear models with the proposed hierarchical priors by incorporating flexible expectation-maximization (EM) algorithms into the standard iteratively weighted least squares as implemented in the general statistical package R. The methods are illustrated with data from an experiment to identify genetic polymorphisms for survival of mice following infection with Listeria monocytogenes. The performance of the proposed procedures is further assessed via simulation studies. The methods are implemented in a freely available R package BhGLM (http://www.ssg.uab.edu/bhglm/).
Adaptive Lasso; Bayesian inference; Generalized linear model; Genetic polymorphisms; Grouped variables; Hierarchical model; High-dimensional data; Shrinkage prior
Grouping structures arise naturally in many statistical modeling problems. Several methods have been proposed for variable selection that respect grouping structure in variables. Examples include the group LASSO and several concave group selection methods. In this article, we give a selective review of group selection concerning methodological developments, theoretical properties and computational algorithms. We pay particular attention to group selection methods involving concave penalties. We address both group selection and bi-level selection methods. We describe several applications of these methods in nonparametric additive models, semiparametric regression, seemingly unrelated regressions, genomic data analysis and genome wide association studies. We also highlight some issues that require further study.
Bi-level selection; group LASSO; concave group selection; penalized regression; sparsity; oracle property
For censored survival outcomes, it can be of great interest to evaluate the predictive power of individual markers or their functions. Compared with alternative evaluation approaches, the time-dependent ROC (receiver operating characteristics) based approaches rely on much weaker assumptions, can be more robust, and hence are preferred. In this article, we examine evaluation of markers’ predictive power using the time-dependent ROC curve and a concordance measure which can be viewed as a weighted area under the time-dependent AUC (area under the ROC curve) profile. This study significantly advances from existing time-dependent ROC studies by developing nonparametric estimators of the summary indexes and, more importantly, rigorously establishing their asymptotic properties. It reinforces the statistical foundation of the time-dependent ROC based evaluation approaches for censored survival outcomes. Numerical studies, including simulations and application to an HIV clinical trial, demonstrate the satisfactory finite-sample performance of the proposed approaches.
time-dependent ROC; concordance measure; inverse-probability-of-censoring weighting; marker evaluation; survival outcomes
In breast cancer research, it is of great interest to identify genomic markers associated with prognosis. Multiple gene profiling studies have been conducted for such a purpose. Genomic markers identified from the analysis of single datasets often do not have satisfactory reproducibility. Among the multiple possible reasons, the most important one is the small sample sizes of individual studies. A cost-effective solution is to pool data from multiple comparable studies and conduct integrative analysis. In this study, we collect four breast cancer prognosis studies with gene expression measurements. We describe the relationship between prognosis and gene expressions using the accelerated failure time (AFT) models. We adopt a 2-norm group bridge penalization approach for marker identification. This integrative analysis approach can effectively identify markers with consistent effects across multiple datasets and naturally accommodate the heterogeneity among studies. Statistical and simulation studies demonstrate satisfactory performance of this approach. Breast cancer prognosis markers identified using this approach have sound biological implications and satisfactory prediction performance.
Breast cancer prognosis; Gene expression; Marker identification; Integrative analysis; 2-norm group bridge
High-throughput studies have been extensively conducted in the research of complex human diseases. As a representative example, consider gene-expression studies where thousands of genes are profiled at the same time. An important objective of such studies is to rank the diagnostic accuracy of biomarkers (e.g. gene expressions) for predicting outcome variables while properly adjusting for confounding effects from low-dimensional clinical risk factors and environmental exposures. Existing approaches are often fully based on parametric or semi-parametric models and target evaluating estimation significance as opposed to diagnostic accuracy. Receiver operating characteristic (ROC) approaches can be employed to tackle this problem. However, existing ROC ranking methods focus on biomarkers only and ignore effects of confounders. In this article, we propose a model-based approach which ranks the diagnostic accuracy of biomarkers using ROC measures with a proper adjustment of confounding effects. To this end, three different methods for constructing the underlying regression models are investigated. Simulation study shows that the proposed methods can accurately identify biomarkers with additional diagnostic power beyond confounders. Analysis of two cancer gene-expression studies demonstrates that adjusting for confounders can lead to substantially different rankings of genes.
ranking biomarkers; ROC; confounders; high-throughput data
Recent biomedical studies often measure two distinct sets of risk factors: low-dimensional clinical and environmental measurements, and high-dimensional gene expression measurements. For prognosis studies with right censored response variables, we propose a semiparametric regression model whose covariate effects have two parts: a nonparametric part for low-dimensional covariates, and a parametric part for high-dimensional covariates. A penalized variable selection approach is developed. The selection of parametric covariate effects is achieved using an iterated Lasso approach, for which we prove the selection consistency property. The nonparametric component is estimated using a sieve approach. An empirical model selection tool for the nonparametric component is derived based on the Kullback-Leibler geometry. Numerical studies show that the proposed approach has satisfactory performance. Application to a lymphoma study illustrates the proposed method.
Semiparametric regression; variable selection; right censored data; iterated Lasso
Illness conditions lead to medical expenditure. Even with various types of medical insurance, there can still be considerable out-of-pocket costs. Medical expenditure can affect other categories of household consumptions. The goal of this study is to provide an updated empirical description of the distributions of illness conditions and medical expenditure and their associations with other categories of household consumptions.
A phone-call survey was conducted in June and July of 2012. The study was approved by ethics review committees at Xiamen University and FuJen Catholic University. Data was collected using a Computer-Assisted Telephone Survey System (CATSS). “Household” was the unit for data collection and analysis. Univariate and multivariate analyses were conducted, examining the distributions of illness conditions and the associations of illness and medical expenditure with other household consumptions.
The presence of chronic disease and inpatient treatment was not significantly associated with household characteristics. The level of per capita medical expenditure was significantly associated with household size, income, and household head occupation. The presence of chronic disease was significantly associated with levels of education, insurance and durable goods consumption. After adjusting for confounders, the associations with education and durable goods consumption remained significant. The presence of inpatient treatment was not associated with consumption levels. In the univariate analysis, medical expenditure was significantly associated with all other consumption categories. After adjusting for confounding effects, the associations between medical expenditure and the actual amount of entertainment expenses and percentages of basic consumption, savings, and insurance (as of total consumption) remained significant.
This study provided an updated description of the distributions of illness conditions and medical expenditure in Taiwan. The findings were mostly positive in that illness and medical expenditure were not observed to be significantly associated with other consumption categories. This observation differed from those made in some other Asian countries and could be explained by the higher economic status and universal basic health insurance coverage of Taiwan.
Illness; Medical expenditure; Household consumption; Taiwan
Sexual function among testicular cancer survivors is a concern because affected men are of reproductive age when diagnosed. We conducted a case-control study among United States military men to examine whether testicular cancer survivors experienced impaired sexual function.
A total of 246 testicular cancer cases and 236 ethnicity and age matched controls were enrolled in the study in 2008-2009. The Brief Male Sexual Function Inventory (BMSFI) was used to assess sexual function.
Compared to controls, cases scored significantly lower on sex drive (5.77 vs. 5.18), erection (9.40 vs. 8.63), ejaculation (10.83 vs. 9.90), and problem assessment (10.55 vs. 9.54). Cases were significantly more likely to have impaired erection (OR 1.72; 95% CI 1.11-2.64), ejaculation (OR 2.27; 95% CI 1.32-3.91), and problem assessment (OR 2.36; 95% CI 1.43-3.90). In histology and treatment analysis, nonseminoma, chemotherapy and radiation treated cases risk of erectile dysfunction, delayed ejaculation, and/or problem assessment were greater when compared to controls.
This study provides evidence that testicular cancer survivors are more likely to have impaired sexual functioning compared to demographically matched controls. The observed impaired sexual functioning appeared to vary by treatment regimen and histologic subtype.
Testicular cancer; sexual function; military men
Non-Hodgkin Lymphoma (NHL) is a heterogeneous group of malignancies with over thirty different subtypes. Follicular lymphoma (FL) is the most common form of indolent NHL and the second most common form of NHL overall. It has morphologic, immunophenotypic and clinical features significantly different from other subtypes. Considerable effort has been devoted to the identification of risk factors for etiology and prognosis of FL. These risk factors may advance our understanding of the biology of FL and have an impact on clinical practice.
The epidemiology of NHL and FL is briefly reviewed. For FL etiology and prognosis separately, we review clinical, environmental and molecular (including genetic, genomic, epigenetic and others) risk factors suggested in the literature.
A large number of potential risk factors have been suggested in recent studies. However, there is a lack of consensus, and many of the suggested risk factors have not been rigorously validated in independent studies. There is a need for large-scale, prospective studies to consolidate existing findings and discover new risk factors. Some of the identified risk factors are successful at the population level. More effective individual-level risk factors and models remain to be identified.
Follicular lymphoma; Etiology; Non-Hodgkin lymphoma; Prognosis; Risk factor
In high-throughput cancer genomic studies, markers identified from the analysis of single data sets often suffer a lack of reproducibility because of the small sample sizes. An ideal solution is to conduct large-scale prospective studies, which are extremely expensive and time consuming. A cost-effective remedy is to pool data from multiple comparable studies and conduct integrative analysis. Integrative analysis of multiple data sets is challenging because of the high dimensionality of genomic measurements and heterogeneity among studies. In this article, we propose a sparse boosting approach for marker identification in integrative analysis of multiple heterogeneous cancer diagnosis studies with gene expression measurements. The proposed approach can effectively accommodate the heterogeneity among multiple studies and identify markers with consistent effects across studies. Simulation shows that the proposed approach has satisfactory identification results and outperforms alternatives including an intensity approach and meta-analysis. The proposed approach is used to identify markers of pancreatic cancer and liver cancer.
Cancer genomics; Marker identification; Sparse boosting
We analyze the Agatston score of coronary artery calcium (CAC) from the Multi-Ethnic Study of Atherosclerosis (MESA) using semi-parametric zero-inflated modeling approach, where the observed CAC scores from this cohort consist of high frequency of zeroes and continuously distributed positive values. Both partially constrained and unconstrained models are considered to investigate the underlying biological processes of CAC development from zero to positive, and from small amount to large amount. Different from existing studies, a model selection procedure based on likelihood cross-validation is adopted to identify the optimal model, which is justified by comparative Monte Carlo studies. A shrinkaged version of cubic regression spline is used for model estimation and variable selection simultaneously. When applying the proposed methods to the MESA data analysis, we show that the two biological mechanisms influencing the initiation of CAC and the magnitude of CAC when it is positive are better characterized by an unconstrained zero-inflated normal model. Our results are significantly different from those in published studies, and may provide further insights into the biological mechanisms underlying CAC development in human. This highly flexible statistical framework can be applied to zero-inflated data analyses in other areas.
cardiovascular disease; coronary artery calcium; likelihood cross-validation; model selection; penalized spline; proportional constraint; shrinkage
In MESA (Multi-Ethnic Study of Atherosclerosis), it is of interest to model the development and progression of CAC (coronary artery calcium). With about half of the CAC scores equal to zero and the rest continuously distributed, semiparametric two-part models are needed. Our main interest lies in determining the (partial) proportionality between the two covariate effects in two-part models. Such an investigation can provide important information on the mechanisms underlying CAC development. We propose a novel approach, which consists of penalized maximum likelihood estimation and a step-wise hypothesis testing procedure to determine proportionality. Simulation shows satisfactory performance of the proposed approach. Analysis of MESA suggests that proportionality holds for all covariates except LDL and HDL.
Two-part models; Proportionality; Semiparametric estimation
Semiparametric regression models with multiple covariates are commonly encountered. When there are covariates not associated with response variable, variable selection may lead to sparser models, more lucid interpretations and more accurate estimation. In this study, we adopt a sieve approach for the estimation of nonparametric covariate effects in semiparametric regression models. We adopt a two-step iterated penalization approach for variable selection. In the first step, a mixture of the Lasso and group Lasso penalties are employed to conduct the first-round variable selection and obtain the initial estimate. In the second step, a mixture of the weighted Lasso and weighted group Lasso penalties, with weights constructed using the initial estimate, are employed for variable selection. We show that the proposed iterated approach has the variable selection consistency property, even when number of unknown parameters diverges with sample size. Numerical studies, including simulation and analysis of a diabetes dataset, show satisfactory performance of the proposed approach.
Iterated penalization; Variable selection; Semiparametric regression
In the evaluation of a healthcare system, it is of interest to identify factors associated with the usage of different healthcare facilities and with different levels of medical expenditure.
A survey was conducted in January and February of 2012 in China. It focused on the middle-aged and elderly with age of 45 and above. A total of 2,093 people from 1,152 households were surveyed.
For inpatient treatment, the probability of using grade III hospitals, which had the highest level of care, was positively associated with age, being married, living in urban areas, and having higher income. For outpatient treatment, the probability of using grade III hospitals was positively associated with age, being married, working in enterprises, living in urban areas, living in central and western regions, and having higher income, and negatively associated with being farmers. The total and out-of-pocket (OOP) medical expenses were analyzed separately. It was found that the expense level was associated with age, education, occupation, living in urban areas, type of hospital used, insurance being used, and per capita income.
The access to healthcare and level of medical expenditure were found as associated with demographic characteristics. In addition, differences between areas and regions were observed. Such results may be useful for identifying vulnerable population and for tuning future healthcare development policies.
The semiparametric partially linear model allows flexible modeling of covariate effects on the response variable in regression. It combines the flexibility of nonparametric regression and parsimony of linear regression. The most important assumption in the existing methods for the estimation in this model is to assume a priori that it is known which covariates have a linear effect and which do not. However, in applied work, this is rarely known in advance. We consider the problem of estimation in the partially linear models without assuming a priori which covariates have linear effects. We propose a semiparametric regression pursuit method for identifying the covariates with a linear effect. Our proposed method is a penalized regression approach using a group minimax concave penalty. Under suitable conditions we show that the proposed approach is model-pursuit consistent, meaning that it can correctly determine which covariates have a linear effect and which do not with high probability. The performance of the proposed method is evaluated using simulation studies, which support our theoretical results. A real data example is used to illustrated the application of the proposed method.
Group selection; Minimax concave penalty; Model-pursuit consistency; Penalized regression; Semiparametric models
The main goal of this study is to examine the distributions of illness conditions and resulting medical expenditures and their associated factors. To achieve this goal, an in-house survey was conducted in August of 2012 in rural Beijing, the capital city of China.
The survey was conducted in Nanjianchang and Beijianchang, which are two villages 20 KM away from Miyun, a satellite city of Beijing. Data was collected on 346 households, which included 834 members. Variables measured included household characteristics, household head characteristics, illness conditions, and medical expenditures. Illness conditions and corresponding expenditure were measured for inpatient treatment, outpatient treatment, and self-treatment separately. Multivariate analysis suggested that the presence of inpatient treatment was associated with household head characteristics including age, gender, and education. The presence of a high level of outpatient treatment was associated with household head characteristics including gender and education. The presence of a high level of self-treatment was significantly associated with household size. In the analysis of overall out-of-pocket (OOP) medical expenditure, only age of household head was borderline significant. In the analysis of OOP inpatient expenditure, age and gender of household head were borderline significant. The OOP outpatient expenditure was associated with household size, presence of members older than 60, household head's gender, marital status, and occupation. The OOP self-treatment expenditure was not associated with any household characteristic.
For the surveyed households, medical expenditure made up a considerable proportion of the total consumption. This study suggested that the presence of illness conditions and resulting OOP medical expenditure were associated with certain household and household head characteristics. Such results may help identify the subgroup that is the most affected by illness conditions. As this study collected recent data on inpatient, outpatient, and self-treatment separately, it may provide a useful complement to the existing studies.
In breast cancer research, it is important to identify genomic markers associated with prognosis. Multiple microarray gene expression profiling studies have been conducted, searching for prognosis markers. Genomic markers identified from the analysis of single datasets often suffer a lack of reproducibility because of small sample sizes. Integrative analysis of data from multiple independent studies has a larger sample size and may provide a cost-effective solution.
We collect four breast cancer prognosis studies with gene expression measurements. An accelerated failure time (AFT) model with an unknown error distribution is adopted to describe survival. An integrative sparse boosting approach is employed for marker selection. The proposed model and boosting approach can effectively accommodate heterogeneity across multiple studies and identify genes with consistent effects.
Simulation study shows that the proposed approach outperforms alternatives including meta-analysis and intensity approaches by identifying the majority or all of the true positives, while having a low false positive rate. In the analysis of breast cancer data, 44 genes are identified as associated with prognosis. Many of the identified genes have been previously suggested as associated with tumorigenesis and cancer prognosis. The identified genes and corresponding predicted risk scores differ from those using alternative approaches. Monte Carlo-based prediction evaluation suggests that the proposed approach has the best prediction performance.
Integrative analysis may provide an effective way of identifying breast cancer prognosis markers. Markers identified using the integrative sparse boosting analysis have sound biological implications and satisfactory prediction performance.
Breast cancer prognosis; Gene Expression; Integrative analysis; Sparse boosting
We characterized the mutational landscape of melanoma, the form of skin cancer with the highest mortality rate, by sequencing the exomes of 147 melanomas. Sun-exposed melanomas had markedly more ultraviolet (UV)-like C>T somatic mutations compared to sun-shielded acral, mucosal and uveal melanomas. Among the newly identified cancer genes was PPP6C, encoding a serine/threonine phosphatase, which harbored mutations that clustered in the active site in 12% of sun-exposed melanomas, exclusively in tumors with mutations in BRAF or NRAS. Notably, we identified a recurrent UV-signature, an activating mutation in RAC1 in 9.2% of sun-exposed melanomas. This activating mutation, the third most frequent in our cohort of sun-exposed melanoma after those of BRAF and NRAS, changes Pro29 to serine (RAC1P29S) in the highly conserved switch I domain. Crystal structures, and biochemical and functional studies of RAC1P29S showed that the alteration releases the conformational restraint conferred by the conserved proline, causes an increased binding of the protein to downstream effectors, and promotes melanocyte proliferation and migration. These findings raise the possibility that pharmacological inhibition of downstream effectors of RAC1 signaling could be of therapeutic benefit.
Cytokines play a critical role in regulating the immune system. In the tumor microenvironment, they influence survival, proliferation, differentiation, and movement of both tumor and stromal cells, and regulate tumor interactions with the extracellular matrix. Given these biologic properties, there is reason to hypothesize that cytokine activity influences the pathogenesis of non-Hodgkin lymphoma (NHL).
We investigated the effect of genetic variation in cytokine genes on NHL prognosis and survival by evaluating genetic variation in individual SNPs as well as the combined effect of multiple deleterious genotypes. Survival information from 496 female incident NHL cases diagnosed during 1996–2000 in Connecticut were abstracted from Connecticut Tumor Registry in 2008. Survival analyses were conducted by comparing Kaplan-Meier curves and hazard ratios (HR) were computed using Cox proportional hazard models adjusting for demographic and tumor characteristics for genes that were suggested by previous studies to be associated with NHL survival.
We found that the variant IL6 genotype is significantly associated (HR=0.42; 95%CI: 0.23–0.77) with a decreased risk of death, as well as relapse and secondary cancer occurrence, among those with NHL. We also found that risk of death, relapse, and secondary cancers varied by specific SNPs for the follicular, DLBCL, and CLL/SLL histologic types. We identified combinations of polymorphisms whose combined deleterious effect significantly alter overall NHL survival and disease-free survival.
Our study provides evidence that the identification of genetic polymorphisms in cytokine genes may help improve the prediction of NHL survival and prognosis.
Non-Hodgkin lymphoma; Cytokines; Single nucleotide polymorphisms; Survival
High-throughput gene profiling studies have been extensively conducted, searching for markers associated with cancer development and progression. In this study, we analyse cancer prognosis studies with right censored survival responses. With gene expression data, we adopt the weighted gene co-expression network analysis (WGCNA) to describe the interplay among genes. In network analysis, nodes represent genes. There are subsets of nodes, called modules, which are tightly connected to each other. Genes within the same modules tend to have co-regulated biological functions. For cancer prognosis data with gene expression measurements, our goal is to identify cancer markers, while properly accounting for the network module structure. A two-step sparse boosting approach, called Network Sparse Boosting (NSBoost), is proposed for marker selection. In the first step, for each module separately, we use a sparse boosting approach for within-module marker selection and construct module-level ‘super markers ’. In the second step, we use the super markers to represent the effects of all genes within the same modules and conduct module-level selection using a sparse boosting approach. Simulation study shows that NSBoost can more accurately identify cancer-associated genes and modules than alternatives. In the analysis of breast cancer and lymphoma prognosis studies, NSBoost identifies genes with important biological implications. It outperforms alternatives including the boosting and penalization approaches by identifying a smaller number of genes/modules and/or having better prediction performance.
The main goal of this study is to examine the associations between illness conditions and out-of-pocket medical expenditure with other types of household consumptions. In November and December of 2011, a survey was conducted in three cities in western China, namely Lan Zhou, Gui Lin and Xi An, and their surrounding rural areas.
Information on demographics, income and consumption was collected on 2,899 households. Data analysis suggested that the presence of household members with chronic diseases was not associated with characteristics of households or household heads. The presence of inpatient treatments was significantly associated with the age of household head (p-value 0.03). The level of per capita medical expense was significantly associated with household size, presence of members younger than 18, older than 65, basic health insurance coverage, per capita income, and household head occupation. Adjusting for confounding effects, the presence of chronic diseases was negatively associated with the amount of basic consumption (p-value 0.02) and the percentage of basic consumption (p-value 0.01), but positively associated with the percentage of insurance expense (p-value 0.02). Medical expenditure was positively associated with all other types of consumptions, including basic, education, saving and investment, entertainment, insurance, durable goods, and alcohol/tobacco. It was negatively associated with the percentage of basic consumption, saving and investment, and insurance.
Early studies conducted in other Asian countries and rural China found negative associations between illness conditions and medical expenditure with other types of consumptions. This study was conducted in three major cities and surrounding areas in western China, which had not been well investigated in published literature. The observed consumption patterns were different from those in early studies, and the negative associations were not observed. This study may complement the existing rural studies and provide useful information on western Chinese cities.