Gene expression profiling has been extensively conducted in cancer research. The analysis of multiple independent cancer gene expression datasets may provide additional information and complement single-dataset analysis. In this study, we conduct multi-dataset analysis and are interested in evaluating the similarity of cancer-associated genes identified from different datasets. The first objective of this study is to briefly review some statistical methods that can be used for such evaluation. Both marginal analysis and joint analysis methods are reviewed. The second objective is to apply those methods to 26 Gene Expression Omnibus (GEO) datasets on five types of cancers. Our analysis suggests that for the same cancer, the marker identification results may vary significantly across datasets, and different datasets share few common genes. In addition, datasets on different cancers share few common genes. The shared genetic basis of datasets on the same or different cancers, which has been suggested in the literature, is not observed in the analysis of GEO data.
cancer gene expression study; marker identification; similarity; GEO
In cancer research, profiling studies have been extensively conducted, searching for genes/SNPs associated with prognosis. Cancer is diverse. Examining the similarity and difference in the genetic basis of multiple subtypes of the same cancer can lead to a better understanding of their connections and distinctions. Classic meta-analysis methods analyze each subtype separately and then compare analysis results across subtypes. Integrative analysis methods, in contrast, analyze the raw data on multiple subtypes simultaneously and can outperform meta-analysis methods. In this study, prognosis data on multiple subtypes of the same cancer are analyzed. An AFT (accelerated failure time) model is adopted to describe survival. The genetic basis of multiple subtypes is described using the heterogeneity model, which allows a gene/SNP to be associated with prognosis of some subtypes but not others. A compound penalization method is developed to identify genes that contain important SNPs associated with prognosis. The proposed method has an intuitive formulation and is realized using an iterative algorithm. Asymptotic properties are rigorously established. Simulation shows that the proposed method has satisfactory performance and outperforms a penalization-based meta-analysis method and a regularized thresholding method. An NHL (non-Hodgkin lymphoma) prognosis study with SNP measurements is analyzed. Genes associated with the three major subtypes, namely DLBCL, FL, and CLL/SLL, are identified. The proposed method identifies genes that are different from alternatives and have important implications and satisfactory prediction performance.
Cancer prognosis; Integrative analysis; Genetic association; Marker identification; Penalization
Studies investigating the relationship between maternal passive smoking and the risk of preterm birth have reached inconsistent conclusions. A birth cohort study that included 10,095 nonsmoking women who delivered a singleton live birth was carried out in Lanzhou, China, between 2010 and 2012. Exposure to passive smoking during pregnancy was associated with an increased risk of very preterm birth (<32 completed weeks of gestation; odds ratio = 1.98, 95% confidence interval: 1.41, 2.76) but not moderate preterm birth (32–36 completed weeks of gestation; odds ratio = 0.98, 95% confidence interval: 0.81, 1.19). Risk of very preterm birth increased with the duration of exposure (P for trend = 0.0014). There was no variability in exposures by trimester. The associations were consistent for both medically indicated and spontaneous preterm births. Overall, our findings support a positive association between passive smoking and the risk of very preterm birth.
birth cohort; China; passive smoking; preterm birth
Non-Hodgkin lymphomas (NHLs) include any kind of lymphoma except Hodgkin’s lymphoma. Mantle cell lymphoma (MCL) is a B-cell NHL and accounts for about 6% of all NHL cases. Its epidemiologic and clinical features, as well as biomarkers, can differ from those of other NHL subtypes. This article first provides a very brief description of MCL’s epidemiology and clinical features. For etiology and prognosis separately, we review clinical, environmental, and molecular risk factors that have been suggested in the literature. Among a large number of potential risk factors, only a few have been independently validated, and their clinical utilization has been limited. More data need to be accumulated and effectively analyzed before clinically useful risk factors can be identified and used for prevention, diagnosis, prediction of prognosis path, and treatment selection.
Mantle cell lymphoma; Risk factors; Etiology; Prognosis
In genomic studies, identifying important gene-environment and gene-gene interactions is a challenging problem. In this study, we adopt the statistical modeling approach, where interactions are represented by product terms in regression models. For the identification of important interactions, we adopt penalization, which has been used in many genomic studies. Straightforward application of penalization does not respect the “main effect, interaction” hierarchical structure. A few recently proposed methods respect this structure by applying constrained penalization. However, they demand very complicated computational algorithms and can only accommodate a small number of genomic measurements. We propose a computationally fast penalization method that can identify important gene-environment and gene-gene interactions and respect a strong hierarchical structure. The method takes a stagewise approach and progressively expands its optimization domain to account for possible hierarchical interactions. It is applicable to multiple data types and models. A coordinate descent method is utilized to produce the entire regularized solution path. Simulation study demonstrates the superior performance of the proposed method. We analyze a lung cancer prognosis study with gene expression measurements and identify important gene-environment interactions.
Gene-environment interactions; Gene-gene interactions; Progressive penalization; Stage-wise regression
Lung cancer rates in Xuanwei are the highest in China. In-home use of smoky coal was associated with lung cancer risk, and the association of smoking and lung cancer risk strengthens after stove improvement. Here, we explored the differential association of tobacco use and lung cancer risk by the intensity, duration, and type of coal used.
Materials and Methods
We conducted a population-based case–control study of 260 male lung cancer cases and 260 age-matched male controls. Odds ratios (OR) and 95% confidence interval (CI) for tobacco use was calculated by conditional logistic regression.
Use of smoky coal was significantly associated with an increased risk of lung cancer risk, and tobacco use was weakly and non-significantly associated with lung cancer risk. When the association was assessed by coal use, the cigarette-lung cancer risk association was null in hazardous coal users and elevated in less hazardous smoky coal users and non-smoky coal users. The risk of lung cancer per cigarette per day decreased as annual use of coal increased (>0-3 tons: OR: 1.09; 95% CI: 1.03-1.17; >3 tons: OR: 0.99; 95% CI: 0.95-1.03). Among more hazardous coal users, attenuation occurs at even low levels of usage (>0-3 tons: OR: 1.02; 95% CI: 0.91-1.14; >3 tons: OR: 0.94; 95% CI: 0.97-1.03).
We found evidence that smoky coal attenuated the tobacco and lung cancer risk association in males that lived in Xuanwei, particularly among users of hazardous coal where even low levels of smoky coal attenuated the association. Our results suggest that the adverse effects of tobacco may become more apparent as China's population continues to switch to using cleaner fuels for the home, underscoring the urgent need for smoking cessation in China and elsewhere.
Coal; tobacco; lung cancer; indoor air pollution; China; global health; epidemiology
In high-throughput studies, an important objective is to identify gene-environment interactions associated with disease outcomes and phenotypes. Many commonly adopted methods assume specific parametric or semiparametric models, which may be subject to model mis-specification. In addition, they usually use significance level as the criterion for selecting important interactions. In this study, we adopt the rank-based estimation, which is much less sensitive to model specification than some of the existing methods and includes several commonly encountered data and models as special cases. Penalization is adopted for the identification of gene-environment interactions. It achieves simultaneous estimation and identification and does not rely on significance level. For computation feasibility, a smoothed rank estimation is further proposed. Simulation shows that under certain scenarios, for example with contaminated or heavy-tailed data, the proposed method can significantly outperform the existing alternatives with more accurate identification. We analyze a lung cancer prognosis study with gene expression measurements under the AFT (accelerated failure time) model. The proposed method identifies interactions different from those using the alternatives. Some of the identified genes have important implications.
Gene-environment interaction; robust rank estimation; penalization; marker identification
In cancer studies with high-throughput genetic and genomic measurements, integrative analysis provides a way to effectively pool and analyze heterogeneous raw data from multiple independent studies and outperforms “classic” meta-analysis and single-dataset analysis. When marker selection is of interest, the genetic basis of multiple datasets can be described using the homogeneity model or the heterogeneity model. In this study, we consider marker selection under the heterogeneity model, which includes the homogeneity model as a special case and can be more flexible. Penalization methods have been developed in the literature for marker selection. This study advances from the published ones by introducing the contrast penalties, which can accommodate the within- and across-dataset structures of covariates/regression coefficients and, by doing so, further improve marker selection performance. Specifically, we develop a penalization method that accommodates the across-dataset structures by smoothing over regression coefficients. An effective iterative algorithm, which calls an inner coordinate descent iteration, is developed. Simulation shows that the proposed method outperforms the benchmark with more accurate marker identification. The analysis of breast cancer and lung cancer prognosis studies with gene expression measurements shows that the proposed method identifies genes different from those using the benchmark and has better prediction performance.
Integrative analysis; Contrasted penalization; Marker selection; High-throughput cancer studies
In cancer diagnosis studies, high-throughput gene profiling has been extensively conducted, searching for genes whose expressions may serve as markers. Data generated from such studies have the “large d, small n” feature, with the number of genes profiled much larger than the sample size. Penalization has been extensively adopted for simultaneous estimation and marker selection. Because of small sample sizes, markers identified from the analysis of single datasets can be unsatisfactory. A cost-effective remedy is to conduct integrative analysis of multiple heterogeneous datasets. In this article, we investigate composite penalization methods for estimation and marker selection in integrative analysis. The proposed methods use the minimax concave penalty (MCP) as the outer penalty. Under the homogeneity model, the ridge penalty is adopted as the inner penalty. Under the heterogeneity model, the Lasso penalty and MCP are adopted as the inner penalty. Effective computational algorithms based on coordinate descent are developed. Numerical studies, including simulation and analysis of practical cancer datasets, show satisfactory performance of the proposed methods.
cancer diagnosis studies; composite penalization; gene expression; integrative analysis
Illness and the medical expenditure that follows have a profound impact on the well-being of individuals and households. China is a huge country with significant regional differences. The goal of this study is to investigate the associations of illness and medical expenditure with other categories of household expenditures, with special attention paid to the differences in observations between the western and eastern regions.
A survey was conducted in six major cities in China, three in the east and three in the west, in 2011. Data on demographics, illness conditions, and medical and other expenditures were collected from 12,515 households.
In the analysis of the associations of illness conditions and medical expenditure with demographics, multiple significant associations were observed, and there are differences between the eastern and western regions. In univariate analyses, illness conditions and medical expenditure were found as having significant associations with other categories of expenditures. In multivariate analyses adjusting for household and household head characteristics, few associations were observed, and there exist differences between the regions.
This study has provided empirical evidence on the associations of illness/medical expenditure with demographics and with other categories of expenditures. Differences across regions were observed in multiple aspects. The reasons underlying such differences are worth investigating further.
Illness condition; Medical expenditure; Household expenditure; Cross-region difference; China
In modern statistical applications, the dimension of covariates can be much larger than the sample size. In the context of linear models, correlation screening (Fan and Lv, 2008) has been shown to reduce the dimension of such data effectively while achieving the sure screening property, i.e., all of the active variables can be retained with high probability. However, screening based on the Pearson correlation does not perform well when applied to contaminated covariates and/or censored outcomes. In this paper, we study censored rank independence screening of high-dimensional survival data. The proposed method is robust to predictors that contain outliers, works for a general class of survival models, and enjoys the sure screening property. Simulations and an analysis of real data demonstrate that the proposed method performs competitively on survival data sets of moderate size and high-dimensional predictors, even when these are contaminated.
High-dimensional survival data; Rank independence screening; Sure screening property
Nasopharyngeal carcinoma (NPC) is a malignant neoplasm arising from the mucosal epithelium of the nasopharynx. Different races can have different etiology, presentation, and progression patterns.
Data were analyzed on NPC patients in the United States reported to the SEER (Surveillance, Epidemiology, and End Results) database between 1973 and 2009. Racial groups studied included non-Hispanic whites, Hispanic whites, blacks, Asians, and others. Patient characteristics, age-adjusted incidence and mortality rates, treatment, and five-year relative survival rates were compared across races. Stratification by stage at diagnosis and histologic type was considered. Multivariate regression was conducted to evaluate the significance of racial differences.
Patient characteristics that were significantly different across races included age at diagnosis, histologic type, in situ/malignant tumors in lifetime, stage, grade, and regional nodes positive. Incidence and mortality rates were significantly different across races, with Asians having the highest rates overall and stratified by age and/or histologic type. Asians also had the highest rate of receiving radiation only. The racial differences in treatment were significant in the multivariate stratified analysis. When stratified by stage and histologic type, Asians had the best five-year survival rates. The survival experience of other races depended on stage and type. In the multivariate analysis, the racial differences were significant.
Analysis of the SEER data shows that racial differences exist among NPC patients in the U.S. This result can be informative to cancer epidemiologists and clinicians.
nasopharyngeal carcinoma; racial differences; SEER
MCL (mantle cell lymphoma) is a rare subtype of NHL (non-Hodgkin lymphoma) with mostly poor prognosis. Different races have different etiology, presentation, and progression patterns.
Data were analyzed on MCL patients in the United States reported to the SEER (Surveillance, Epidemiology, and End Results) database between 1992 and 2009. SEER contains the most comprehensive population-based cancer information in the U.S., covering approximately 28% of the population. Racial groups analyzed included non-Hispanic whites, Hispanic whites, blacks, and Asians/PIs (Pacific Islanders). Patient characteristics, age-adjusted incidence rate, and survival rate were compared across races. Stratification by age, gender, and stage at diagnosis was considered. Multivariate analysis was conducted on survival.
In the analysis of patients’ characteristics, distributions of gender, marital status, age at diagnosis, stage, and extranodal involvement were significantly different across races. For all three age groups and both male and female, non-Hispanic whites have the highest incidence rates. In the analysis of survival, for cancers diagnosed in the period of 1992–2004, no significant racial difference is observed. For cancers diagnosed in the period of 1999–2004, significant racial differences exist for the 40–64 age group and stage III and IV cancers.
Racial differences exist among MCL patients in the U.S. in terms of patients’ characteristics, incidence, and survival. More extended data collection and analysis are needed to more comprehensively describe and understand the racial differences.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2407-14-764) contains supplementary material, which is available to authorized users.
Mantle cell lymphoma; Racial differences; SEER; Non-hodgkin lymphoma
High-throughput cancer studies have been extensively conducted, searching for genetic markers associated with outcomes beyond clinical and environmental risk factors. Gene–environment interactions can have important implications beyond main effects. The commonly-adopted single-marker analysis cannot accommodate the joint effects of a large number of markers. The existing joint-effects methods also have limitations. Specifically, they may suffer from high computational cost, do not respect the “main effect, interaction” hierarchical structure, or use ineffective techniques. We develop a penalization method for the identification of important G × E interactions and main effects. It has an intuitive formulation, respects the hierarchical structure, accommodates the joint effects of multiple markers, and is computationally affordable. In numerical study, we analyze prognosis data under the AFT (accelerated failure time) model. Simulation shows satisfactory performance of the proposed method. Analysis of an NHL (non-Hodgkin lymphoma) study with SNP measurements shows that the proposed method identifies markers with important implications and satisfactory prediction performance.
Gene–environment interaction; Penalized marker identification; Cancer prognosis
Penalized regression methods are becoming increasingly popular in genome-wide association studies (GWAS) for identifying genetic markers associated with disease. However, standard penalized methods such as LASSO do not take into account the possible linkage disequilibrium between adjacent markers. We propose a novel penalized approach for GWAS using a dense set of single nucleotide polymorphisms (SNPs). The proposed method uses the minimax concave penalty (MCP) for marker selection and incorporates linkage disequilibrium (LD) information by penalizing the difference of the genetic effects at adjacent SNPs with high correlation. A coordinate descent algorithm is derived to implement the proposed method. This algorithm is efficient in dealing with a large number of SNPs. A multi-split method is used to calculate the p-values of the selected SNPs for assessing their significance. We refer to the proposed penalty function as the smoothed MCP and the proposed approach as the SMCP method. Performance of the proposed SMCP method and its comparison with LASSO and MCP approaches are evaluated through simulation studies, which demonstrate that the proposed method is more accurate in selecting associated SNPs. Its applicability to real data is illustrated using heterogeneous stock mice data and a rheumatoid arthritis.
Genetic association; Feature selection; Linkage disequilibrium; Penalized regression; Single nucleotide polymorphism
In the analysis of cancer studies with high-dimensional genomic measurements, integrative analysis provides an effective way of pooling information across multiple heterogeneous datasets. The genomic basis of multiple independent datasets, which can be characterized by the sets of genomic markers, can be described using the homogeneity model or heterogeneity model. Under the homogeneity model, all datasets share the same set of markers associated with responses. In contrast, under the heterogeneity model, different studies have overlapping but possibly different sets of markers. The heterogeneity model contains the homogeneity model as a special case and can be much more flexible. Marker selection under the heterogeneity model calls for bi-level selection to determine whether a covariate is associated with response in any study at all as well as in which studies it is associated with responses. In this study, we consider two minimax concave penalty (MCP) based penalization approaches for marker selection under the heterogeneity model. For each approach, we describe its rationale and an effective computational algorithm. We conduct simulation to investigate their performance and compare with the existing alternatives. We also apply the proposed approaches to the analysis of gene expression data on multiple cancers.
Integrative analysis; Heterogeneity model; Marker selection
Impaired function of Janus kinase/signal transducer and activator of transcription (JAK/STAT) signaling pathway genes leads to immunodeficiency and various hematopoietic disorders. We evaluated the association between genetic polymorphisms (SNPs) in 12 JAK/STAT pathway genes (JAK3, STAT1, STAT2, STAT3, STAT4, STAT5a, STAT5b, STAT6, SCOS1, SCOS2, SCOS3, and SCOS4) and NHL risk in a population-based case-control study of Connecticut women. We identified three SNPs in STAT3 (rs12949918 and rs6503695) and STAT4 (rs932169) associated with NHL risk after adjustment for multiple comparison. Our results suggest that genetic variation in JAK/STAT pathway genes may play a role in lymphomagenesis and warrants further investigation.
JAK/STAT signaling pathway; Non-Hodgkin Lymphoma; polymorphism; case-control study
To develop a reference of population-based gestational age-specific birth weight percentiles for contemporary Chinese.
Birth weight data was collected by the China National Population-based Birth Defects Surveillance System. A total of 1,105,214 live singleton births aged ≥28 weeks of gestation without birth defects during 2006–2010 were included. The lambda-mu-sigma method was utilized to generate percentiles and curves.
Gestational age-specific birth weight percentiles for male and female infants were constructed separately. Significant differences were observed between the current reference and other references developed for Chinese or non-Chinese infants.
There have been moderate increases in birth weight percentiles for Chinese infants of both sexes and most gestational ages since 1980s, suggesting the importance of utilizing an updated national reference for both clinical and research purposes.
In cancer research, high-throughput profiling studies have been extensively conducted, searching for markers associated with prognosis. Because of the “large d, small n” characteristic, results generated from the analysis of a single dataset can be unsatisfactory. Recent studies have shown that integrative analysis, which simultaneously analyzes multiple datasets, can be more effective than single-dataset analysis and classic meta-analysis. In most of existing integrative analysis, the homogeneity model has been assumed, which postulates that different datasets share the same set of markers. Several approaches have been designed to reinforce this assumption. In practice, different datasets may differ in terms of patient selection criteria, profiling techniques, and many other aspects. Such differences may make the homogeneity model too restricted. In this study, we assume the heterogeneity model, under which different datasets are allowed to have different sets of markers. With multiple cancer prognosis datasets, we adopt the AFT (accelerated failure time) model to describe survival. This model may have the lowest computational cost among popular semiparametric survival models. For marker selection, we adopt a sparse group MCP (minimax concave penalty) approach. This approach has an intuitive formulation and can be computed using an effective group coordinate descent algorithm. Simulation study shows that it outperforms the existing approaches under both the homogeneity and heterogeneity models. Data analysis further demonstrates the merit of heterogeneity model and proposed approach.
Integrative analysis; Cancer prognosis; Heterogeneity model; Penalization
To study the association between serum C-reactive protein (CRP) and urinary albumin excretion in the Multi-Ethnic Study of Atherosclerosis and to assess whether the association is modified by ethnicity, sex, or systolic blood pressure.
This was a cross-sectional study of 6675 participants who were free from macro albuminuria and clinical cardiovascular disease (mean age 62.1 years, 53% female; 39% White, 27% African American, 22% Hispanic, and 12% Chinese). Urinary albumin excretion was measured by spot urine albumin-to-creatinine ratio (ACR). Effect modifications were tested after adjusting for age, diabetes, body mass index, smoking, use of angiotensin-converting enzyme inhibitor or angiotensin-receptor blocker, other antihypertensive drugs, estrogens, statins, and high-density lipoprotein cholesterol and triglyceride levels.
The association between CRP and ACR was modified by ethnicity (P=.01) and sex (P<.001), but not by systolic blood pressure. After multivariate adjustment, the association remained in Chinese, African American, and Hispanic men and African American women (P<.02 for African American men, and P<.04 for the other subgroups).
The association between CRP and ACR was modified by ethnicity and sex; it was stronger in non-White men and African American women. These interactions have not been reported before, and future studies should consider them.
Albuminuria; C-Reactive Protein; Ethnicity; Gender
We conducted a population-based case-control study in Connecticut women to test the hypothesis that genetic variations in DNA repair pathway genes may modify the relationship between body mass index (BMI) and risk of non-Hodgkin lymphoma (NHL). Compared to those with BMI < 25, women with BMI ≥ 25 had significantly increased risk of NHL among women who carried BRCA1 (rs799917) CT/TT, ERCC2 (rs13181) AA, XRCC1 (rs1799782) CC, and WRN (rs1801195) GG genotypes, but no increase in NHL risk among women who carried BRCA1 CC, ERCC2 AC/CC, XRCC1 CT/TT, and WRN GT/TT genotypes. A significant interaction with BMI was only observed for WRN (rs1801195, P=0.004) for T-cell lymphoma and ERCC2 (rs13181, P=0.002) for diffuse large B-cell lymphoma. The results suggest that common genetic variation in DNA repair pathway genes may modify the association between BMI and NHL risk.
Non-Hodgkin lymphoma; BMI; polymorphisms; DNA repair genes
We consider analysis of Genetic Analysis Workshop 18 data, which involves multiple longitudinal traits and dense genome-wide single-nucleotide polymorphism (SNP) markers. We use a multivariate linear mixed model to account for the covariance of random effects and multivariate residuals. We divide the SNPs into groups according to the genes they belong to and score them using weighted sum statistics. We propose a penalized approach for genetic variant selection at the gene level. The overall modeling and penalized selection method is referred to as the penalized multivariate linear mixed model. Cross-validation is used for tuning parameter selection. A resampling approach is adopted to evaluate the relative stability of the identified genes. Application to the Genetic Analysis Workshop 18 data shows that the proposed approach can effectively select markers associated with phenotypes at gene level.
Thyroid cancers have increased dramatically over the past few decades. Comorbidities may be important, and previous studies have indicated elevated second cancer risk after initial primary thyroid cancers. This study examined the risk of second cancers after development of a thyroid cancer, primary utilizing the Surveillance, Epidemiology, and End Results (SEER) program database.
The cohort consisted of men and women diagnosed with first primary thyroid cancer who were reported to a SEER database in 1973–2008 (n=52,103). Standardized incidence ratios (SIR) were calculated for all secondary cancers. Confidence intervals and p-values are at 0.05 significance alpha level and are two-sided based on Poisson exact methods.
In this cohort, 4457 individuals developed second cancers. The risk of developing second cancers after a primary thyroid cancer varied from 10% to 150% depending on different cancer types. Cancers in all sites, breast, skin, prostate, kidney, brain, salivary gland, second thyroid, lymphoma, myeloma, and leukemia were elevated. The magnitude of the risk varied by histology, tumor size, calendar year of first primary thyroid cancer diagnosis, and the treatment of the primary thyroid cancer. The risk of a second cancer was elevated in patients whose first primary thyroid carcinomas were small, or were diagnosed after 1994, or in whom some form of radiation treatment was administered.
This large population-based analysis of second cancers among thyroid cancer patients suggests that there was an increase of second cancers in all sites, and the most commonly elevated second cancers were the salivary gland and kidney. Additionally, the increase in second cancers in patients with recently diagnosed thyroid microcarcinomas (<10 mm) suggests that aggressive radiation treatment of the first primary thyroid cancer, the environment, and genetic susceptibility, may increase the risk of a second cancer.
In genome-wide association studies, penalization is an important approach for identifying genetic markers associated with disease. Motivated by the fact that there exists natural grouping structure in single nucleotide polymorphisms and, more importantly, such groups are correlated, we propose a new penalization method for group variable selection which can properly accommodate the correlation between adjacent groups. This method is based on a combination of the group Lasso penalty and a quadratic penalty on the difference of regression coefficients of adjacent groups. The new method is referred to as smoothed group Lasso (SGL). It encourages group sparsity and smoothes regression coefficients for adjacent groups. Canonical correlations are applied to the weights between groups in the quadratic difference penalty. We first derive a GCD algorithm for computing the solution path with linear regression model. The SGL method is further extended to logistic regression for binary response. With the assistance of the majorize–minimization algorithm, the SGL penalized logistic regression turns out to be an iteratively penalized least-square problem. We also suggest conducting principal component analysis to reduce the dimensionality within groups. Simulation studies are used to evaluate the finite sample performance. Comparison with group Lasso shows that SGL is more effective in selecting true positives. Two datasets are analyzed using the SGL method.
Group selection; Regularization; SNP; Smoothing