1.  Integrative Analysis of High-throughput Cancer Studies with Contrasted Penalization 
Genetic epidemiology  2014;38(2):144-151.
In cancer studies with high-throughput genetic and genomic measurements, integrative analysis provides a way to effectively pool and analyze heterogeneous raw data from multiple independent studies and outperforms “classic” meta-analysis and single-dataset analysis. When marker selection is of interest, the genetic basis of multiple datasets can be described using the homogeneity model or the heterogeneity model. In this study, we consider marker selection under the heterogeneity model, which includes the homogeneity model as a special case and can be more flexible. Penalization methods have been developed in the literature for marker selection. This study advances from the published ones by introducing the contrast penalties, which can accommodate the within- and across-dataset structures of covariates/regression coefficients and, by doing so, further improve marker selection performance. Specifically, we develop a penalization method that accommodates the across-dataset structures by smoothing over regression coefficients. An effective iterative algorithm, which calls an inner coordinate descent iteration, is developed. Simulation shows that the proposed method outperforms the benchmark with more accurate marker identification. The analysis of breast cancer and lung cancer prognosis studies with gene expression measurements shows that the proposed method identifies genes different from those using the benchmark and has better prediction performance.
PMCID: PMC4355402  PMID: 24395534
Integrative analysis; Contrasted penalization; Marker selection; High-throughput cancer studies
2.  Integrative Analysis of Cancer Diagnosis Studies with Composite Penalization 
In cancer diagnosis studies, high-throughput gene profiling has been extensively conducted, searching for genes whose expressions may serve as markers. Data generated from such studies have the “large d, small n” feature, with the number of genes profiled much larger than the sample size. Penalization has been extensively adopted for simultaneous estimation and marker selection. Because of small sample sizes, markers identified from the analysis of single datasets can be unsatisfactory. A cost-effective remedy is to conduct integrative analysis of multiple heterogeneous datasets. In this article, we investigate composite penalization methods for estimation and marker selection in integrative analysis. The proposed methods use the minimax concave penalty (MCP) as the outer penalty. Under the homogeneity model, the ridge penalty is adopted as the inner penalty. Under the heterogeneity model, the Lasso penalty and MCP are adopted as the inner penalty. Effective computational algorithms based on coordinate descent are developed. Numerical studies, including simulation and analysis of practical cancer datasets, show satisfactory performance of the proposed methods.
PMCID: PMC3933169  PMID: 24578589
cancer diagnosis studies; composite penalization; gene expression; integrative analysis
3.  Illness and medical and other expenditures: observations from western and eastern China 
Illness and the medical expenditure that follows have a profound impact on the well-being of individuals and households. China is a huge country with significant regional differences. The goal of this study is to investigate the associations of illness and medical expenditure with other categories of household expenditures, with special attention paid to the differences in observations between the western and eastern regions.
A survey was conducted in six major cities in China, three in the east and three in the west, in 2011. Data on demographics, illness conditions, and medical and other expenditures were collected from 12,515 households.
In the analysis of the associations of illness conditions and medical expenditure with demographics, multiple significant associations were observed, and there are differences between the eastern and western regions. In univariate analyses, illness conditions and medical expenditure were found as having significant associations with other categories of expenditures. In multivariate analyses adjusting for household and household head characteristics, few associations were observed, and there exist differences between the regions.
This study has provided empirical evidence on the associations of illness/medical expenditure with demographics and with other categories of expenditures. Differences across regions were observed in multiple aspects. The reasons underlying such differences are worth investigating further.
PMCID: PMC4336723  PMID: 25879667
Illness condition; Medical expenditure; Household expenditure; Cross-region difference; China
4.  Array Platform Modeling and Analysis (A) 
Cancer Informatics  2015;13(Suppl 4):91-93.
PMCID: PMC4334035  PMID: 25733794
5.  Censored Rank Independence Screening for High-dimensional Survival Data 
Biometrika  2014;101(4):799-814.
In modern statistical applications, the dimension of covariates can be much larger than the sample size. In the context of linear models, correlation screening (Fan and Lv, 2008) has been shown to reduce the dimension of such data effectively while achieving the sure screening property, i.e., all of the active variables can be retained with high probability. However, screening based on the Pearson correlation does not perform well when applied to contaminated covariates and/or censored outcomes. In this paper, we study censored rank independence screening of high-dimensional survival data. The proposed method is robust to predictors that contain outliers, works for a general class of survival models, and enjoys the sure screening property. Simulations and an analysis of real data demonstrate that the proposed method performs competitively on survival data sets of moderate size and high-dimensional predictors, even when these are contaminated.
PMCID: PMC4318124  PMID: 25663709
High-dimensional survival data; Rank independence screening; Sure screening property
6.  Racial Differences in Nasopharyngeal Carcinoma in the United States 
Cancer epidemiology  2013;37(6):10.1016/j.canep.2013.08.008.
Nasopharyngeal carcinoma (NPC) is a malignant neoplasm arising from the mucosal epithelium of the nasopharynx. Different races can have different etiology, presentation, and progression patterns.
Data were analyzed on NPC patients in the United States reported to the SEER (Surveillance, Epidemiology, and End Results) database between 1973 and 2009. Racial groups studied included non-Hispanic whites, Hispanic whites, blacks, Asians, and others. Patient characteristics, age-adjusted incidence and mortality rates, treatment, and five-year relative survival rates were compared across races. Stratification by stage at diagnosis and histologic type was considered. Multivariate regression was conducted to evaluate the significance of racial differences.
Patient characteristics that were significantly different across races included age at diagnosis, histologic type, in situ/malignant tumors in lifetime, stage, grade, and regional nodes positive. Incidence and mortality rates were significantly different across races, with Asians having the highest rates overall and stratified by age and/or histologic type. Asians also had the highest rate of receiving radiation only. The racial differences in treatment were significant in the multivariate stratified analysis. When stratified by stage and histologic type, Asians had the best five-year survival rates. The survival experience of other races depended on stage and type. In the multivariate analysis, the racial differences were significant.
Analysis of the SEER data shows that racial differences exist among NPC patients in the U.S. This result can be informative to cancer epidemiologists and clinicians.
PMCID: PMC3851929  PMID: 24035238
nasopharyngeal carcinoma; racial differences; SEER
7.  Racial differences in mantle cell lymphoma in the United States 
BMC Cancer  2014;14:764.
MCL (mantle cell lymphoma) is a rare subtype of NHL (non-Hodgkin lymphoma) with mostly poor prognosis. Different races have different etiology, presentation, and progression patterns.
Data were analyzed on MCL patients in the United States reported to the SEER (Surveillance, Epidemiology, and End Results) database between 1992 and 2009. SEER contains the most comprehensive population-based cancer information in the U.S., covering approximately 28% of the population. Racial groups analyzed included non-Hispanic whites, Hispanic whites, blacks, and Asians/PIs (Pacific Islanders). Patient characteristics, age-adjusted incidence rate, and survival rate were compared across races. Stratification by age, gender, and stage at diagnosis was considered. Multivariate analysis was conducted on survival.
In the analysis of patients’ characteristics, distributions of gender, marital status, age at diagnosis, stage, and extranodal involvement were significantly different across races. For all three age groups and both male and female, non-Hispanic whites have the highest incidence rates. In the analysis of survival, for cancers diagnosed in the period of 1992–2004, no significant racial difference is observed. For cancers diagnosed in the period of 1999–2004, significant racial differences exist for the 40–64 age group and stage III and IV cancers.
Racial differences exist among MCL patients in the U.S. in terms of patients’ characteristics, incidence, and survival. More extended data collection and analysis are needed to more comprehensively describe and understand the racial differences.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2407-14-764) contains supplementary material, which is available to authorized users.
PMCID: PMC4210548  PMID: 25315847
Mantle cell lymphoma; Racial differences; SEER; Non-hodgkin lymphoma
8.  Identification of gene–environment interactions in cancer studies using penalization 
Genomics  2013;102(4):10.1016/j.ygeno.2013.08.006.
High-throughput cancer studies have been extensively conducted, searching for genetic markers associated with outcomes beyond clinical and environmental risk factors. Gene–environment interactions can have important implications beyond main effects. The commonly-adopted single-marker analysis cannot accommodate the joint effects of a large number of markers. The existing joint-effects methods also have limitations. Specifically, they may suffer from high computational cost, do not respect the “main effect, interaction” hierarchical structure, or use ineffective techniques. We develop a penalization method for the identification of important G × E interactions and main effects. It has an intuitive formulation, respects the hierarchical structure, accommodates the joint effects of multiple markers, and is computationally affordable. In numerical study, we analyze prognosis data under the AFT (accelerated failure time) model. Simulation shows satisfactory performance of the proposed method. Analysis of an NHL (non-Hodgkin lymphoma) study with SNP measurements shows that the proposed method identifies markers with important implications and satisfactory prediction performance.
PMCID: PMC3869641  PMID: 23994599
Gene–environment interaction; Penalized marker identification; Cancer prognosis
9.  Accounting for linkage disequilibrium in genome-wide association studies: A penalized regression method 
Statistics and its interface  2013;6(1):99-115.
Penalized regression methods are becoming increasingly popular in genome-wide association studies (GWAS) for identifying genetic markers associated with disease. However, standard penalized methods such as LASSO do not take into account the possible linkage disequilibrium between adjacent markers. We propose a novel penalized approach for GWAS using a dense set of single nucleotide polymorphisms (SNPs). The proposed method uses the minimax concave penalty (MCP) for marker selection and incorporates linkage disequilibrium (LD) information by penalizing the difference of the genetic effects at adjacent SNPs with high correlation. A coordinate descent algorithm is derived to implement the proposed method. This algorithm is efficient in dealing with a large number of SNPs. A multi-split method is used to calculate the p-values of the selected SNPs for assessing their significance. We refer to the proposed penalty function as the smoothed MCP and the proposed approach as the SMCP method. Performance of the proposed SMCP method and its comparison with LASSO and MCP approaches are evaluated through simulation studies, which demonstrate that the proposed method is more accurate in selecting associated SNPs. Its applicability to real data is illustrated using heterogeneous stock mice data and a rheumatoid arthritis.
PMCID: PMC4172344  PMID: 25258655
Genetic association; Feature selection; Linkage disequilibrium; Penalized regression; Single nucleotide polymorphism
10.  Integrative analysis of multiple cancer genomic datasets under the heterogeneity model 
Statistics in medicine  2013;32(20):3509-3521.
In the analysis of cancer studies with high-dimensional genomic measurements, integrative analysis provides an effective way of pooling information across multiple heterogeneous datasets. The genomic basis of multiple independent datasets, which can be characterized by the sets of genomic markers, can be described using the homogeneity model or heterogeneity model. Under the homogeneity model, all datasets share the same set of markers associated with responses. In contrast, under the heterogeneity model, different studies have overlapping but possibly different sets of markers. The heterogeneity model contains the homogeneity model as a special case and can be much more flexible. Marker selection under the heterogeneity model calls for bi-level selection to determine whether a covariate is associated with response in any study at all as well as in which studies it is associated with responses. In this study, we consider two minimax concave penalty (MCP) based penalization approaches for marker selection under the heterogeneity model. For each approach, we describe its rationale and an effective computational algorithm. We conduct simulation to investigate their performance and compare with the existing alternatives. We also apply the proposed approaches to the analysis of gene expression data on multiple cancers.
PMCID: PMC3743947  PMID: 23519988
Integrative analysis; Heterogeneity model; Marker selection
11.  Polymorphisms in JAK/STAT Signaling Pathway Genes and Risk of Non-Hodgkin Lymphoma 
Leukemia research  2013;37(9):1120-1124.
Impaired function of Janus kinase/signal transducer and activator of transcription (JAK/STAT) signaling pathway genes leads to immunodeficiency and various hematopoietic disorders. We evaluated the association between genetic polymorphisms (SNPs) in 12 JAK/STAT pathway genes (JAK3, STAT1, STAT2, STAT3, STAT4, STAT5a, STAT5b, STAT6, SCOS1, SCOS2, SCOS3, and SCOS4) and NHL risk in a population-based case-control study of Connecticut women. We identified three SNPs in STAT3 (rs12949918 and rs6503695) and STAT4 (rs932169) associated with NHL risk after adjustment for multiple comparison. Our results suggest that genetic variation in JAK/STAT pathway genes may play a role in lymphomagenesis and warrants further investigation.
PMCID: PMC3998836  PMID: 23768868
JAK/STAT signaling pathway; Non-Hodgkin Lymphoma; polymorphism; case-control study
12.  Birth Weight Reference Percentiles for Chinese 
PLoS ONE  2014;9(8):e104779.
To develop a reference of population-based gestational age-specific birth weight percentiles for contemporary Chinese.
Birth weight data was collected by the China National Population-based Birth Defects Surveillance System. A total of 1,105,214 live singleton births aged ≥28 weeks of gestation without birth defects during 2006–2010 were included. The lambda-mu-sigma method was utilized to generate percentiles and curves.
Gestational age-specific birth weight percentiles for male and female infants were constructed separately. Significant differences were observed between the current reference and other references developed for Chinese or non-Chinese infants.
There have been moderate increases in birth weight percentiles for Chinese infants of both sexes and most gestational ages since 1980s, suggesting the importance of utilizing an updated national reference for both clinical and research purposes.
PMCID: PMC4134219  PMID: 25127131
13.  Sparse Group Penalized Integrative Analysis of Multiple Cancer Prognosis Datasets 
Genetics research  2013;95(0):68-77.
In cancer research, high-throughput profiling studies have been extensively conducted, searching for markers associated with prognosis. Because of the “large d, small n” characteristic, results generated from the analysis of a single dataset can be unsatisfactory. Recent studies have shown that integrative analysis, which simultaneously analyzes multiple datasets, can be more effective than single-dataset analysis and classic meta-analysis. In most of existing integrative analysis, the homogeneity model has been assumed, which postulates that different datasets share the same set of markers. Several approaches have been designed to reinforce this assumption. In practice, different datasets may differ in terms of patient selection criteria, profiling techniques, and many other aspects. Such differences may make the homogeneity model too restricted. In this study, we assume the heterogeneity model, under which different datasets are allowed to have different sets of markers. With multiple cancer prognosis datasets, we adopt the AFT (accelerated failure time) model to describe survival. This model may have the lowest computational cost among popular semiparametric survival models. For marker selection, we adopt a sparse group MCP (minimax concave penalty) approach. This approach has an intuitive formulation and can be computed using an effective group coordinate descent algorithm. Simulation study shows that it outperforms the existing approaches under both the homogeneity and heterogeneity models. Data analysis further demonstrates the merit of heterogeneity model and proposed approach.
PMCID: PMC4090387  PMID: 23938111
Integrative analysis; Cancer prognosis; Heterogeneity model; Penalization
14.  Ethnicity and Sex Modify the Association of Serum C-Reactive Protein with Microalbuminuria 
Ethnicity & disease  2008;18(3):324-329.
To study the association between serum C-reactive protein (CRP) and urinary albumin excretion in the Multi-Ethnic Study of Atherosclerosis and to assess whether the association is modified by ethnicity, sex, or systolic blood pressure.
This was a cross-sectional study of 6675 participants who were free from macro albuminuria and clinical cardiovascular disease (mean age 62.1 years, 53% female; 39% White, 27% African American, 22% Hispanic, and 12% Chinese). Urinary albumin excretion was measured by spot urine albumin-to-creatinine ratio (ACR). Effect modifications were tested after adjusting for age, diabetes, body mass index, smoking, use of angiotensin-converting enzyme inhibitor or angiotensin-receptor blocker, other antihypertensive drugs, estrogens, statins, and high-density lipoprotein cholesterol and triglyceride levels.
The association between CRP and ACR was modified by ethnicity (P=.01) and sex (P<.001), but not by systolic blood pressure. After multivariate adjustment, the association remained in Chinese, African American, and Hispanic men and African American women (P<.02 for African American men, and P<.04 for the other subgroups).
The association between CRP and ACR was modified by ethnicity and sex; it was stronger in non-White men and African American women. These interactions have not been reported before, and future studies should consider them.
PMCID: PMC4089959  PMID: 18785447
Albuminuria; C-Reactive Protein; Ethnicity; Gender
15.  Polymorphisms in DNA Repair Pathway Genes, Body Mass Index, and Risk of Non-Hodgkin Lymphoma 
American journal of hematology  2013;88(7):606-611.
We conducted a population-based case-control study in Connecticut women to test the hypothesis that genetic variations in DNA repair pathway genes may modify the relationship between body mass index (BMI) and risk of non-Hodgkin lymphoma (NHL). Compared to those with BMI < 25, women with BMI ≥ 25 had significantly increased risk of NHL among women who carried BRCA1 (rs799917) CT/TT, ERCC2 (rs13181) AA, XRCC1 (rs1799782) CC, and WRN (rs1801195) GG genotypes, but no increase in NHL risk among women who carried BRCA1 CC, ERCC2 AC/CC, XRCC1 CT/TT, and WRN GT/TT genotypes. A significant interaction with BMI was only observed for WRN (rs1801195, P=0.004) for T-cell lymphoma and ERCC2 (rs13181, P=0.002) for diffuse large B-cell lymphoma. The results suggest that common genetic variation in DNA repair pathway genes may modify the association between BMI and NHL risk.
PMCID: PMC3902049  PMID: 23619945
Non-Hodgkin lymphoma; BMI; polymorphisms; DNA repair genes
16.  Penalized multivariate linear mixed model for longitudinal genome-wide association studies 
BMC Proceedings  2014;8(Suppl 1):S73.
We consider analysis of Genetic Analysis Workshop 18 data, which involves multiple longitudinal traits and dense genome-wide single-nucleotide polymorphism (SNP) markers. We use a multivariate linear mixed model to account for the covariance of random effects and multivariate residuals. We divide the SNPs into groups according to the genes they belong to and score them using weighted sum statistics. We propose a penalized approach for genetic variant selection at the gene level. The overall modeling and penalized selection method is referred to as the penalized multivariate linear mixed model. Cross-validation is used for tuning parameter selection. A resampling approach is adopted to evaluate the relative stability of the identified genes. Application to the Genetic Analysis Workshop 18 data shows that the proposed approach can effectively select markers associated with phenotypes at gene level.
PMCID: PMC4143695  PMID: 25519343
17.  The Risk of Second Cancers After Diagnosis of Primary Thyroid Cancer Is Elevated in Thyroid Microcarcinomas 
Thyroid  2013;23(5):575-582.
Thyroid cancers have increased dramatically over the past few decades. Comorbidities may be important, and previous studies have indicated elevated second cancer risk after initial primary thyroid cancers. This study examined the risk of second cancers after development of a thyroid cancer, primary utilizing the Surveillance, Epidemiology, and End Results (SEER) program database.
The cohort consisted of men and women diagnosed with first primary thyroid cancer who were reported to a SEER database in 1973–2008 (n=52,103). Standardized incidence ratios (SIR) were calculated for all secondary cancers. Confidence intervals and p-values are at 0.05 significance alpha level and are two-sided based on Poisson exact methods.
In this cohort, 4457 individuals developed second cancers. The risk of developing second cancers after a primary thyroid cancer varied from 10% to 150% depending on different cancer types. Cancers in all sites, breast, skin, prostate, kidney, brain, salivary gland, second thyroid, lymphoma, myeloma, and leukemia were elevated. The magnitude of the risk varied by histology, tumor size, calendar year of first primary thyroid cancer diagnosis, and the treatment of the primary thyroid cancer. The risk of a second cancer was elevated in patients whose first primary thyroid carcinomas were small, or were diagnosed after 1994, or in whom some form of radiation treatment was administered.
This large population-based analysis of second cancers among thyroid cancer patients suggests that there was an increase of second cancers in all sites, and the most commonly elevated second cancers were the salivary gland and kidney. Additionally, the increase in second cancers in patients with recently diagnosed thyroid microcarcinomas (<10 mm) suggests that aggressive radiation treatment of the first primary thyroid cancer, the environment, and genetic susceptibility, may increase the risk of a second cancer.
PMCID: PMC3643257  PMID: 23237308
18.  Incorporating group correlations in genome-wide association studies using smoothed group Lasso 
Biostatistics (Oxford, England)  2012;14(2):205-219.
In genome-wide association studies, penalization is an important approach for identifying genetic markers associated with disease. Motivated by the fact that there exists natural grouping structure in single nucleotide polymorphisms and, more importantly, such groups are correlated, we propose a new penalization method for group variable selection which can properly accommodate the correlation between adjacent groups. This method is based on a combination of the group Lasso penalty and a quadratic penalty on the difference of regression coefficients of adjacent groups. The new method is referred to as smoothed group Lasso (SGL). It encourages group sparsity and smoothes regression coefficients for adjacent groups. Canonical correlations are applied to the weights between groups in the quadratic difference penalty. We first derive a GCD algorithm for computing the solution path with linear regression model. The SGL method is further extended to logistic regression for binary response. With the assistance of the majorize–minimization algorithm, the SGL penalized logistic regression turns out to be an iteratively penalized least-square problem. We also suggest conducting principal component analysis to reduce the dimensionality within groups. Simulation studies are used to evaluate the finite sample performance. Comparison with group Lasso shows that SGL is more effective in selecting true positives. Two datasets are analyzed using the SGL method.
PMCID: PMC3590928  PMID: 22988281
Group selection; Regularization; SNP; Smoothing
19.  Incorporating Network Structure in Integrative Analysis of Cancer Prognosis Data 
Genetic epidemiology  2012;37(2):173-183.
In high-throughput cancer genomic studies, markers identified from the analysis of single datasets may have unsatisfactory properties because of low sample sizes. Integrative analysis pools and analyzes raw data from multiple studies, and can effectively increase sample size and lead to improved marker identification results. In this study, we consider the integrative analysis of multiple high-throughput cancer prognosis studies. In the existing integrative analysis studies, the interplay among genes, which can be described using the network structure, has not been effectively accounted for. In network analysis, tightly-connected nodes (genes) are more likely to have related biological functions and similar regression coefficients. The goal of this study is to develop an analysis approach that can incorporate the gene network structure in integrative analysis. To this end, we adopt an AFT (accelerated failure time) model to describe survival. A weighted least squares approach, which has low computational cost, is adopted for estimation. For marker selection, we propose a new penalization approach. The proposed penalty is composed of two parts. The first part is a group MCP penalty, and conducts gene selection. The second part is a Laplacian penalty, and smoothes the differences of coefficients for tightly-connected genes. A group coordinate descent approach is developed to compute the proposed estimate. Simulation study shows satisfactory performance of the proposed approach when there exist moderate to strong correlations among genes. We analyze three lung cancer prognosis datasets, and demonstrate that incorporating the network structure can lead to the identification of important genes and improved prediction performance.
PMCID: PMC3909475  PMID: 23161517
Integrative analysis; Cancer prognosis; Gene network; Penalized selection; Laplacian shrinkage
20.  Integrative Analysis of Cancer Prognosis Data with Multiple Subtypes Using Regularized Gradient Descent 
Genetic epidemiology  2012;36(8):829-838.
In cancer research, high-throughput profiling studies have been extensively conducted, searching for genes/SNPs associated with prognosis. Despite seemingly significant differences, different subtypes of the same cancer (or different types of cancers) may share common susceptibility genes. In this study, we analyze prognosis data on multiple subtypes of the same cancer, but note that the proposed approach is directly applicable to the analysis of data on multiple types of cancers. We describe the genetic basis of multiple subtypes using the heterogeneity model, which allows overlapping but different sets of susceptibility genes/SNPs for different subtypes. An accelerated failure time (AFT) model is adopted to describe prognosis. We develop a regularized gradient descent approach, which conducts gene-level analysis and identifies genes that contain important SNPs associated with prognosis. The proposed approach belongs to the family of gradient descent approaches, is intuitively reasonable, and has affordable computational cost. Simulation study shows that when prognosis-associated SNPs are clustered in a small number of genes, the proposed approach outperforms alternatives with significantly more true positives and fewer false positives. We analyze an NHL (non-Hodgkin lymphoma) prognosis study with SNP measurements, and identify genes associated with the three major subtypes of NHL, namely DLBCL, FL and CLL/SLL. The proposed approach identifies genes different from using alternative approaches and has the best prediction performance.
PMCID: PMC3729731  PMID: 22851516
Integrative analysis; Cancer Prognosis; Gradient descent; NHL; SNP
21.  Health Insurance Utilization and Its Impact: Observations from the Middle-Aged and Elderly in China 
PLoS ONE  2013;8(12):e80978.
In China, despite a high coverage rate, health insurance is not used for all illness episodes. Our goal is to identify subjects’ characteristics associated with insurance utilization and the association between utilization and medical expenditure.
A survey was conducted in January and February of 2012. 2093 middle-aged and elderly subjects (45 years old and above) were surveyed.
Heath insurance was not utilized for 12.6% (inpatient), 53.3% (outpatient), and 72.6% (self-treatment) of disease episodes. Subjects’ characteristics were associated with insurance utilization. Inpatient and outpatient treatments were expensive. In the multivariate analysis of outpatient treatment expenditure, insurance utilization was significantly associated with higher treatment cost, lost income, and gross total cost.
Utilization of health insurance may need to be improved. Insurance utilization can reduce out-of-pocket medical expenditure. However, the amount paid by the insured is still high. Policy intervention is needed to further improve the effectiveness of health insurance.
PMCID: PMC3855696  PMID: 24324654
22.  Hierarchical Shrinkage Priors and Model Fitting for High-dimensional Generalized Linear Models 
Statistical applications in genetics and molecular biology  2012;11(6):10.1515/1544-6115.1803 /j/sagmb.2012.11.issue-6/1544-6115.1803/1544-6115.1803.xml.
Genetic and other scientific studies routinely generate very many predictor variables, which can be naturally grouped, with predictors in the same groups being highly correlated. It is desirable to incorporate the hierarchical structure of the predictor variables into generalized linear models for simultaneous variable selection and coefficient estimation. We propose two prior distributions: hierarchical Cauchy and double-exponential distributions, on coefficients in generalized linear models. The hierarchical priors include both variable-specific and group-specific tuning parameters, thereby not only adopting different shrinkage for different coefficients and different groups but also providing a way to pool the information within groups. We fit generalized linear models with the proposed hierarchical priors by incorporating flexible expectation-maximization (EM) algorithms into the standard iteratively weighted least squares as implemented in the general statistical package R. The methods are illustrated with data from an experiment to identify genetic polymorphisms for survival of mice following infection with Listeria monocytogenes. The performance of the proposed procedures is further assessed via simulation studies. The methods are implemented in a freely available R package BhGLM (
PMCID: PMC3658361  PMID: 23192052
Adaptive Lasso; Bayesian inference; Generalized linear model; Genetic polymorphisms; Grouped variables; Hierarchical model; High-dimensional data; Shrinkage prior
23.  A Selective Review of Group Selection in High-Dimensional Models 
Grouping structures arise naturally in many statistical modeling problems. Several methods have been proposed for variable selection that respect grouping structure in variables. Examples include the group LASSO and several concave group selection methods. In this article, we give a selective review of group selection concerning methodological developments, theoretical properties and computational algorithms. We pay particular attention to group selection methods involving concave penalties. We address both group selection and bi-level selection methods. We describe several applications of these methods in nonparametric additive models, semiparametric regression, seemingly unrelated regressions, genomic data analysis and genome wide association studies. We also highlight some issues that require further study.
PMCID: PMC3810358  PMID: 24174707
Bi-level selection; group LASSO; concave group selection; penalized regression; sparsity; oracle property
24.  Nonparametric ROC Based Evaluation for Survival Outcomes 
Statistics in medicine  2012;31(23):2660-2675.
For censored survival outcomes, it can be of great interest to evaluate the predictive power of individual markers or their functions. Compared with alternative evaluation approaches, the time-dependent ROC (receiver operating characteristics) based approaches rely on much weaker assumptions, can be more robust, and hence are preferred. In this article, we examine evaluation of markers’ predictive power using the time-dependent ROC curve and a concordance measure which can be viewed as a weighted area under the time-dependent AUC (area under the ROC curve) profile. This study significantly advances from existing time-dependent ROC studies by developing nonparametric estimators of the summary indexes and, more importantly, rigorously establishing their asymptotic properties. It reinforces the statistical foundation of the time-dependent ROC based evaluation approaches for censored survival outcomes. Numerical studies, including simulations and application to an HIV clinical trial, demonstrate the satisfactory finite-sample performance of the proposed approaches.
PMCID: PMC3743052  PMID: 22987578
time-dependent ROC; concordance measure; inverse-probability-of-censoring weighting; marker evaluation; survival outcomes
25.  Identification of Breast Cancer Prognosis Markers via Integrative Analysis 
In breast cancer research, it is of great interest to identify genomic markers associated with prognosis. Multiple gene profiling studies have been conducted for such a purpose. Genomic markers identified from the analysis of single datasets often do not have satisfactory reproducibility. Among the multiple possible reasons, the most important one is the small sample sizes of individual studies. A cost-effective solution is to pool data from multiple comparable studies and conduct integrative analysis. In this study, we collect four breast cancer prognosis studies with gene expression measurements. We describe the relationship between prognosis and gene expressions using the accelerated failure time (AFT) models. We adopt a 2-norm group bridge penalization approach for marker identification. This integrative analysis approach can effectively identify markers with consistent effects across multiple datasets and naturally accommodate the heterogeneity among studies. Statistical and simulation studies demonstrate satisfactory performance of this approach. Breast cancer prognosis markers identified using this approach have sound biological implications and satisfactory prediction performance.
PMCID: PMC3389801  PMID: 22773869
Breast cancer prognosis; Gene expression; Marker identification; Integrative analysis; 2-norm group bridge

