Search tips
Search criteria

Results 1-25 (1384639)

Clipboard (0)

Related Articles

1.  Genomic selection using regularized linear regression models: ridge regression, lasso, elastic net and their extensions 
BMC Proceedings  2012;6(Suppl 2):S10.
Genomic selection (GS) is emerging as an efficient and cost-effective method for estimating breeding values using molecular markers distributed over the entire genome. In essence, it involves estimating the simultaneous effects of all genes or chromosomal segments and combining the estimates to predict the total genomic breeding value (GEBV). Accurate prediction of GEBVs is a central and recurring challenge in plant and animal breeding. The existence of a bewildering array of approaches for predicting breeding values using markers underscores the importance of identifying approaches able to efficiently and accurately predict breeding values. Here, we comparatively evaluate the predictive performance of six regularized linear regression methods-- ridge regression, ridge regression BLUP, lasso, adaptive lasso, elastic net and adaptive elastic net-- for predicting GEBV using dense SNP markers.
We predicted GEBVs for a quantitative trait using a dataset on 3000 progenies of 20 sires and 200 dams and an accompanying genome consisting of five chromosomes with 9990 biallelic SNP-marker loci simulated for the QTL-MAS 2011 workshop. We applied all the six methods that use penalty-based (regularization) shrinkage to handle datasets with far more predictors than observations. The lasso, elastic net and their adaptive extensions further possess the desirable property that they simultaneously select relevant predictive markers and optimally estimate their effects. The regression models were trained with a subset of 2000 phenotyped and genotyped individuals and used to predict GEBVs for the remaining 1000 progenies without phenotypes. Predictive accuracy was assessed using the root mean squared error, the Pearson correlation between predicted GEBVs and (1) the true genomic value (TGV), (2) the true breeding value (TBV) and (3) the simulated phenotypic values based on fivefold cross-validation (CV).
The elastic net, lasso, adaptive lasso and the adaptive elastic net all had similar accuracies but outperformed ridge regression and ridge regression BLUP in terms of the Pearson correlation between predicted GEBVs and the true genomic value as well as the root mean squared error. The performance of RR-BLUP was also somewhat better than that of ridge regression. This pattern was replicated by the Pearson correlation between predicted GEBVs and the true breeding values (TBV) and the root mean squared error calculated with respect to TBV, except that accuracy was lower for all models, most especially for the adaptive elastic net. The correlation between the predicted GEBV and simulated phenotypic values based on the fivefold CV also revealed a similar pattern except that the adaptive elastic net had lower accuracy than both the ridge regression methods.
All the six models had relatively high prediction accuracies for the simulated data set. Accuracy was higher for the lasso type methods than for ridge regression and ridge regression BLUP.
PMCID: PMC3363152  PMID: 22640436
2.  Significance testing in ridge regression for genetic data 
BMC Bioinformatics  2011;12:372.
Technological developments have increased the feasibility of large scale genetic association studies. Densely typed genetic markers are obtained using SNP arrays, next-generation sequencing technologies and imputation. However, SNPs typed using these methods can be highly correlated due to linkage disequilibrium among them, and standard multiple regression techniques fail with these data sets due to their high dimensionality and correlation structure. There has been increasing interest in using penalised regression in the analysis of high dimensional data. Ridge regression is one such penalised regression technique which does not perform variable selection, instead estimating a regression coefficient for each predictor variable. It is therefore desirable to obtain an estimate of the significance of each ridge regression coefficient.
We develop and evaluate a test of significance for ridge regression coefficients. Using simulation studies, we demonstrate that the performance of the test is comparable to that of a permutation test, with the advantage of a much-reduced computational cost. We introduce the p-value trace, a plot of the negative logarithm of the p-values of ridge regression coefficients with increasing shrinkage parameter, which enables the visualisation of the change in p-value of the regression coefficients with increasing penalisation. We apply the proposed method to a lung cancer case-control data set from EPIC, the European Prospective Investigation into Cancer and Nutrition.
The proposed test is a useful alternative to a permutation test for the estimation of the significance of ridge regression coefficients, at a much-reduced computational cost. The p-value trace is an informative graphical tool for evaluating the results of a test of significance of ridge regression coefficients as the shrinkage parameter increases, and the proposed test makes its production computationally feasible.
PMCID: PMC3228544  PMID: 21929786
3.  The Stream Algorithm: Computationally Efficient Ridge-Regression via Bayesian Model Averaging, and Applications to Pharmacogenomic Prediction of Cancer Cell Line Sensitivity 
Computational efficiency is important for learning algorithms operating in the “large p, small n” setting. In computational biology, the analysis of data sets containing tens of thousands of features (“large p”), but only a few hundred samples (“small n”), is nowadays routine, and regularized regression approaches such as ridge-regression, lasso, and elastic-net are popular choices. In this paper we propose a novel and highly efficient Bayesian inference method for fitting ridge-regression. Our method is fully analytical, and bypasses the need for expensive tuning parameter optimization, via cross-validation, by employing Bayesian model averaging over the grid of tuning parameters. Additional computational efficiency is achieved by adopting the singular value decomposition re-parametrization of the ridge-regression model, replacing computationally expensive inversions of large p × p matrices by efficient inversions of small and diagonal n × n matrices. We show in simulation studies and in the analysis of two large cancer cell line data panels that our algorithm achieves slightly better predictive performance than cross-validated ridge-regression while requiring only a fraction of the computation time. Furthermore, in comparisons based on the cell line data sets, our algorithm systematically out-performs the lasso in both predictive performance and computation time, and shows equivalent predictive performance, but considerably smaller computation time, than the elastic-net.
PMCID: PMC3911888  PMID: 24297531
ridge-regression; Bayesian model averaging; predictive modeling; machine learning; cancer cell lines; pharmacogenomic screens
4.  Cancer Screening: A Mathematical Model Relating Secreted Blood Biomarker Levels to Tumor Sizes  
PLoS Medicine  2008;5(8):e170.
Increasing efforts and financial resources are being invested in early cancer detection research. Blood assays detecting tumor biomarkers promise noninvasive and financially reasonable screening for early cancer with high potential of positive impact on patients' survival and quality of life. For novel tumor biomarkers, the actual tumor detection limits are usually unknown and there have been no studies exploring the tumor burden detection limits of blood tumor biomarkers using mathematical models. Therefore, the purpose of this study was to develop a mathematical model relating blood biomarker levels to tumor burden.
Methods and Findings
Using a linear one-compartment model, the steady state between tumor biomarker secretion into and removal out of the intravascular space was calculated. Two conditions were assumed: (1) the compartment (plasma) is well-mixed and kinetically homogenous; (2) the tumor biomarker consists of a protein that is secreted by tumor cells into the extracellular fluid compartment, and a certain percentage of the secreted protein enters the intravascular space at a continuous rate. The model was applied to two pathophysiologic conditions: tumor biomarker is secreted (1) exclusively by the tumor cells or (2) by both tumor cells and healthy normal cells. To test the model, a sensitivity analysis was performed assuming variable conditions of the model parameters. The model parameters were primed on the basis of literature data for two established and well-studied tumor biomarkers (CA125 and prostate-specific antigen [PSA]). Assuming biomarker secretion by tumor cells only and 10% of the secreted tumor biomarker reaching the plasma, the calculated minimally detectable tumor sizes ranged between 0.11 mm3 and 3,610.14 mm3 for CA125 and between 0.21 mm3 and 131.51 mm3 for PSA. When biomarker secretion by healthy cells and tumor cells was assumed, the calculated tumor sizes leading to positive test results ranged between 116.7 mm3 and 1.52 × 106 mm3 for CA125 and between 27 mm3 and 3.45 × 105 mm3 for PSA. One of the limitations of the study is the absence of quantitative data available in the literature on the secreted tumor biomarker amount per cancer cell in intact whole body animal tumor models or in cancer patients. Additionally, the fraction of secreted tumor biomarkers actually reaching the plasma is unknown. Therefore, we used data from published cell culture experiments to estimate tumor cell biomarker secretion rates and assumed a wide range of secretion rates to account for their potential changes due to field effects of the tumor environment.
This study introduced a linear one-compartment mathematical model that allows estimation of minimal detectable tumor sizes based on blood tumor biomarker assays. Assuming physiological data on CA125 and PSA from the literature, the model predicted detection limits of tumors that were in qualitative agreement with the actual clinical performance of both biomarkers. The model may be helpful in future estimation of minimal detectable tumor sizes for novel proteomic biomarker assays if sufficient physiologic data for the biomarker are available. The model may address the potential and limitations of tumor biomarkers, help prioritize biomarkers, and guide investments into early cancer detection research efforts.
Sanjiv Gambhir and colleagues describe a linear one-compartment mathematical model that allows estimation of minimal detectable tumor sizes based on blood tumor biomarker assays.
Editors' Summary
Cancers—disorganized masses of cells that can occur in any tissue—develop when cells acquire genetic changes that allow them to grow uncontrollably and to spread around the body (metastasize). If a cancer (tumor) is detected when it is small, surgery can often provide a cure. Unfortunately, many cancers (particularly those deep inside the body) are not detected until they are large enough to cause pain or other symptoms by pressing against surrounding tissue. By this time, it may be impossible to remove the original tumor surgically and there may be metastases scattered around the body. In such cases, radiotherapy and chemotherapy can sometimes help, but the outlook for patients whose cancers are detected late is often poor. Consequently, researchers are trying to develop early detection tests for different types of cancer. Many tumors release specific proteins—“cancer biomarkers”—into the blood and the hope is that it might be possible to find sets of blood biomarkers that detect cancers when they are still small and thus save many lives.
Why Was This Study Done?
For most biomarkers, it is not known how the amount of protein detected in the blood relates to tumor size or how sensitive the assays for biomarkers must be to improve patient survival. In this study, the researchers develop a “linear one-compartment” mathematical model to predict how large tumors need to be before blood biomarkers can be used to detect them and test this model using published data on two established cancer biomarkers—CA125 and prostate-specific antigen (PSA). CA125 is used to monitor the progress of patients with ovarian cancer after treatment; ovarian cancer is rarely diagnosed in its early stages and only one-fourth of women with advanced disease survive for 5 y after diagnosis. PSA is used to screen for prostate cancer and has increased the detection of this cancer in its early stages when it is curable.
What Did the Researchers Do and Find?
To develop a model that relates secreted blood biomarker levels to tumor sizes, the researchers assumed that biomarkers mix evenly throughout the patient's blood, that cancer cells secrete biomarkers into the fluid that surrounds them, that 0.1%–20% of these secreted proteins enter the blood at a continuous rate, and that biomarkers are continuously removed from the blood. The researchers then used their model to calculate the smallest tumor sizes that might be detectable with these biomarkers by feeding in existing data on CA125 and on PSA, including assay detection limits and the biomarker secretion rates of cancer cells growing in dishes. When only tumor cells secreted the biomarker and 10% of the secreted biomarker reach the blood, the model predicted that ovarian tumors between 0.11 mm3 (smaller than a grain of salt) and nearly 4,000 mm3 (about the size of a cherry) would be detectable by measuring CA125 blood levels (the range was determined by varying the amount of biomarker secreted by the tumor cells and the assay sensitivity); for prostate cancer, the detectable tumor sizes ranged from similar lower size to about 130 mm3 (pea-sized). However, healthy cells often also secrete small quantities of cancer biomarkers. With this condition incorporated into the model, the estimated detectable tumor sizes (or total tumor burden including metastases) ranged between grape-sized and melon-sized for ovarian cancers and between pea-sized to about grapefruit-sized for prostate cancers.
What Do These Findings Mean?
The accuracy of the calculated tumor sizes provided by the researchers' mathematical model is limited by the lack of data on how tumors behave in the human body and by the many assumptions incorporated into the model. Nevertheless, the model predicts detection limits for ovarian and prostate cancer that broadly mirror the clinical performance of both biomarkers. Somewhat worryingly, the model also indicates that a tumor may have to be very large for blood biomarkers to reveal its presence, a result that could limit the clinical usefulness of biomarkers, especially if they are secreted not only by tumor cells but also by healthy cells. Given this finding, as more information about how biomarkers behave in the human body becomes available, this model (and more complex versions of it) should help researchers decide which biomarkers are likely to improve early cancer detection and patient outcomes.
Additional Information.
Please access these Web sites via the online version of this summary at
The US National Cancer Institute provides a brief description of what cancer is and how it develops and a fact sheet on tumor markers; it also provides information on all aspects of ovarian and prostate cancer for patients and professionals, including information on screening and testing (in English and Spanish)
The UK charity Cancerbackup also provides general information about cancer and more specific information about ovarian and prostate cancer, including the use of CA125 and PSA for screening and follow-up
The American Society of Clinical Oncology offers a wide range of information on various cancer types, including online published articles on the current status of cancer diagnosis and management from the educational book developed by the annual meeting faculty and presenters. Registration is mandatory, but information is free
PMCID: PMC2517618  PMID: 18715113
5.  Biomarker Profiling by Nuclear Magnetic Resonance Spectroscopy for the Prediction of All-Cause Mortality: An Observational Study of 17,345 Persons 
PLoS Medicine  2014;11(2):e1001606.
In this study, Würtz and colleagues conducted high-throughput profiling of blood specimens in two large population-based cohorts in order to identify biomarkers for all-cause mortality and enhance risk prediction. The authors found that biomarker profiling improved prediction of the short-term risk of death from all causes above established risk factors. However, further investigations are needed to clarify the biological mechanisms and the utility of these biomarkers to guide screening and prevention.
Please see later in the article for the Editors' Summary
Early identification of ambulatory persons at high short-term risk of death could benefit targeted prevention. To identify biomarkers for all-cause mortality and enhance risk prediction, we conducted high-throughput profiling of blood specimens in two large population-based cohorts.
Methods and Findings
106 candidate biomarkers were quantified by nuclear magnetic resonance spectroscopy of non-fasting plasma samples from a random subset of the Estonian Biobank (n = 9,842; age range 18–103 y; 508 deaths during a median of 5.4 y of follow-up). Biomarkers for all-cause mortality were examined using stepwise proportional hazards models. Significant biomarkers were validated and incremental predictive utility assessed in a population-based cohort from Finland (n = 7,503; 176 deaths during 5 y of follow-up). Four circulating biomarkers predicted the risk of all-cause mortality among participants from the Estonian Biobank after adjusting for conventional risk factors: alpha-1-acid glycoprotein (hazard ratio [HR] 1.67 per 1–standard deviation increment, 95% CI 1.53–1.82, p = 5×10−31), albumin (HR 0.70, 95% CI 0.65–0.76, p = 2×10−18), very-low-density lipoprotein particle size (HR 0.69, 95% CI 0.62–0.77, p = 3×10−12), and citrate (HR 1.33, 95% CI 1.21–1.45, p = 5×10−10). All four biomarkers were predictive of cardiovascular mortality, as well as death from cancer and other nonvascular diseases. One in five participants in the Estonian Biobank cohort with a biomarker summary score within the highest percentile died during the first year of follow-up, indicating prominent systemic reflections of frailty. The biomarker associations all replicated in the Finnish validation cohort. Including the four biomarkers in a risk prediction score improved risk assessment for 5-y mortality (increase in C-statistics 0.031, p = 0.01; continuous reclassification improvement 26.3%, p = 0.001).
Biomarker associations with cardiovascular, nonvascular, and cancer mortality suggest novel systemic connectivities across seemingly disparate morbidities. The biomarker profiling improved prediction of the short-term risk of death from all causes above established risk factors. Further investigations are needed to clarify the biological mechanisms and the utility of these biomarkers for guiding screening and prevention.
Please see later in the article for the Editors' Summary
Editors' Summary
A biomarker is a biological molecule found in blood, body fluids, or tissues that may signal an abnormal process, a condition, or a disease. The level of a particular biomarker may indicate a patient's risk of disease, or likely response to a treatment. For example, cholesterol levels are measured to assess the risk of heart disease. Most current biomarkers are used to test an individual's risk of developing a specific condition. There are none that accurately assess whether a person is at risk of ill health generally, or likely to die soon from a disease. Early and accurate identification of people who appear healthy but in fact have an underlying serious illness would provide valuable opportunities for preventative treatment.
While most tests measure the levels of a specific biomarker, there are some technologies that allow blood samples to be screened for a wide range of biomarkers. These include nuclear magnetic resonance (NMR) spectroscopy and mass spectrometry. These tools have the potential to be used to screen the general population for a range of different biomarkers.
Why Was This Study Done?
Identifying new biomarkers that provide insight into the risk of death from all causes could be an important step in linking different diseases and assessing patient risk. The authors in this study screened patient samples using NMR spectroscopy for biomarkers that accurately predict the risk of death particularly amongst the general population, rather than amongst people already known to be ill.
What Did the Researchers Do and Find?
The researchers studied two large groups of people, one in Estonia and one in Finland. Both countries have set up health registries that collect and store blood samples and health records over many years. The registries include large numbers of people who are representative of the wider population.
The researchers first tested blood samples from a representative subset of the Estonian group, testing 9,842 samples in total. They looked at 106 different biomarkers in each sample using NMR spectroscopy. They also looked at the health records of this group and found that 508 people died during the follow-up period after the blood sample was taken, the majority from heart disease, cancer, and other diseases. Using statistical analysis, they looked for any links between the levels of different biomarkers in the blood and people's short-term risk of dying. They found that the levels of four biomarkers—plasma albumin, alpha-1-acid glycoprotein, very-low-density lipoprotein (VLDL) particle size, and citrate—appeared to accurately predict short-term risk of death. They repeated this study with the Finnish group, this time with 7,503 individuals (176 of whom died during the five-year follow-up period after giving a blood sample) and found similar results.
The researchers carried out further statistical analyses to take into account other known factors that might have contributed to the risk of life-threatening illness. These included factors such as age, weight, tobacco and alcohol use, cholesterol levels, and pre-existing illness, such as diabetes and cancer. The association between the four biomarkers and short-term risk of death remained the same even when controlling for these other factors.
The analysis also showed that combining the test results for all four biomarkers, to produce a biomarker score, provided a more accurate measure of risk than any of the biomarkers individually. This biomarker score also proved to be the strongest predictor of short-term risk of dying in the Estonian group. Individuals with a biomarker score in the top 20% had a risk of dying within five years that was 19 times greater than that of individuals with a score in the bottom 20% (288 versus 15 deaths).
What Do These Findings Mean?
This study suggests that there are four biomarkers in the blood—alpha-1-acid glycoprotein, albumin, VLDL particle size, and citrate—that can be measured by NMR spectroscopy to assess whether otherwise healthy people are at short-term risk of dying from heart disease, cancer, and other illnesses. However, further validation of these findings is still required, and additional studies should examine the biomarker specificity and associations in settings closer to clinical practice. The combined biomarker score appears to be a more accurate predictor of risk than tests for more commonly known risk factors. Identifying individuals who are at high risk using these biomarkers might help to target preventative medical treatments to those with the greatest need.
However, there are several limitations to this study. As an observational study, it provides evidence of only a correlation between a biomarker score and ill health. It does not identify any underlying causes. Other factors, not detectable by NMR spectroscopy, might be the true cause of serious health problems and would provide a more accurate assessment of risk. Nor does this study identify what kinds of treatment might prove successful in reducing the risks. Therefore, more research is needed to determine whether testing for these biomarkers would provide any clinical benefit.
There were also some technical limitations to the study. NMR spectroscopy does not detect as many biomarkers as mass spectrometry, which might therefore identify further biomarkers for a more accurate risk assessment. In addition, because both study groups were northern European, it is not yet known whether the results would be the same in other ethnic groups or populations with different lifestyles.
In spite of these limitations, the fact that the same four biomarkers are associated with a short-term risk of death from a variety of diseases does suggest that similar underlying mechanisms are taking place. This observation points to some potentially valuable areas of research to understand precisely what's contributing to the increased risk.
Additional Information
Please access these websites via the online version of this summary at
The US National Institute of Environmental Health Sciences has information on biomarkers
The US Food and Drug Administration has a Biomarker Qualification Program to help researchers in identifying and evaluating new biomarkers
Further information on the Estonian Biobank is available
The Computational Medicine Research Team of the University of Oulu and the University of Bristol have a webpage that provides further information on high-throughput biomarker profiling by NMR spectroscopy
PMCID: PMC3934819  PMID: 24586121
6.  A deterministic map of Waddington's epigenetic landscape for cell fate specification 
BMC Systems Biology  2011;5:85.
The image of the "epigenetic landscape", with a series of branching valleys and ridges depicting stable cellular states and the barriers between those states, has been a popular visual metaphor for cell lineage specification - especially in light of the recent discovery that terminally differentiated adult cells can be reprogrammed into pluripotent stem cells or into alternative cell lineages. However the question of whether the epigenetic landscape can be mapped out quantitatively to provide a predictive model of cellular differentiation remains largely unanswered.
Here we derive a simple deterministic path-integral quasi-potential, based on the kinetic parameters of a gene network regulating cell fate, and show that this quantity is minimized along a temporal trajectory in the state space of the gene network, thus providing a marker of directionality for cell differentiation processes. We then use the derived quasi-potential as a measure of "elevation" to quantitatively map the epigenetic landscape, on which trajectories flow "downhill" from any location. Stochastic simulations confirm that the elevation of this computed landscape correlates to the likelihood of occurrence of particular cell fates, with well-populated low-lying "valleys" representing stable cellular states and higher "ridges" acting as barriers to transitions between the stable states.
This quantitative map of the epigenetic landscape underlying cell fate choice provides mechanistic insights into the "forces" that direct cellular differentiation in the context of physiological development, as well as during artificially induced cell lineage reprogramming. Our generalized approach to mapping the landscape is applicable to non-gradient gene regulatory systems for which an analytical potential function cannot be derived, and also to high-dimensional gene networks. Rigorous quantification of the gene regulatory circuits that govern cell lineage choice and subsequent mapping of the epigenetic landscape can potentially help identify optimal routes of cell fate reprogramming.
PMCID: PMC3213676  PMID: 21619617
7.  The Presentation of Dermatoglyphic Abnormalities in Schizophrenia: A Meta-Analytic Review 
Schizophrenia research  2012;142(1-3):1-11.
Within a neurodevelopmental model of schizophrenia, prenatal developmental deviations are implicated as early signs of increased risk for future illness. External markers of central nervous system maldevelopment may provide information regarding the nature and timing of prenatal disruptions among individuals with schizophrenia. One such marker is dermatoglyphic abnormalities (DAs) or unusual epidermal ridge patterns. Studies targeting DAs as a potential sign of early developmental disruption have yielded mixed results with regard to the strength of the association between DAs and schizophrenia. The current study aimed to resolve these inconsistencies by conducting a meta-analysis examining the six most commonly cited dermatoglyphic features among individuals with diagnoses of schizophrenia. Twenty-two studies published between 1968 and 2012 were included. Results indicated significant but small effects for total finger ridge count and total A-B ridge count, with lower counts among individuals with schizophrenia relative to controls. Other DAs examined in the current meta-analysis did not yield significant effects. Total finger ridge count and total A-B ridge count appear to yield the most reliable dermatoglyphic differences between individuals with and without schizophrenia.
PMCID: PMC3502669  PMID: 23116885
schizophrenia; dermatoglyphics; meta-analysis; neurodevelopment
8.  Best Linear Unbiased Prediction of Genomic Breeding Values Using a Trait-Specific Marker-Derived Relationship Matrix 
PLoS ONE  2010;5(9):e12648.
With the availability of high density whole-genome single nucleotide polymorphism chips, genomic selection has become a promising method to estimate genetic merit with potentially high accuracy for animal, plant and aquaculture species of economic importance. With markers covering the entire genome, genetic merit of genotyped individuals can be predicted directly within the framework of mixed model equations, by using a matrix of relationships among individuals that is derived from the markers. Here we extend that approach by deriving a marker-based relationship matrix specifically for the trait of interest.
Methodology/Principal Findings
In the framework of mixed model equations, a new best linear unbiased prediction (BLUP) method including a trait-specific relationship matrix (TA) was presented and termed TABLUP. The TA matrix was constructed on the basis of marker genotypes and their weights in relation to the trait of interest. A simulation study with 1,000 individuals as the training population and five successive generations as candidate population was carried out to validate the proposed method. The proposed TABLUP method outperformed the ridge regression BLUP (RRBLUP) and BLUP with realized relationship matrix (GBLUP). It performed slightly worse than BayesB with an accuracy of 0.79 in the standard scenario.
The proposed TABLUP method is an improvement of the RRBLUP and GBLUP method. It might be equivalent to the BayesB method but it has additional benefits like the calculation of accuracies for individual breeding values. The results also showed that the TA-matrix performs better in predicting ability than the classical numerator relationship matrix and the realized relationship matrix which are derived solely from pedigree or markers without regard to the trait. This is because the TA-matrix not only accounts for the Mendelian sampling term, but also puts the greater emphasis on those markers that explain more of the genetic variance in the trait.
PMCID: PMC2936569  PMID: 20844593
9.  Predicting the survival time for diffuse large B-cell lymphoma using microarray data 
The present study was conducted to predict survival time in patients with diffuse large B-cell lymphoma, DLBCL, based on microarray data using Cox regression model combined with seven dimension reduction methods. This historical cohort included 2042 gene expression measurements from 40 patients with DLBCL. In order to predict survival, a combination of Cox regression model was used with seven methods for dimension reduction or shrinkage including univariate selection, forward stepwise selection, principal component regression, supervised principal component regression, partial least squares regression, ridge regression and Losso. The capacity of predictions was examined by three different criteria including log rank test, prognostic index and deviance. MATLAB r2008a and RKWard software were used for data analysis. Based on our findings, performance of ridge regression was better than other methods. Based on ridge regression coefficients and a given cut point value, 16 genes were selected. By using forward stepwise selection method in Cox regression model, it was indicated that the expression of genes GENE3555X and GENE3807X decreased the survival time (P=0.008 and P=0.003, respectively), whereas the genes GENE3228X and GENE1551X increased survival time (P=0.002 and P<0.001, respectively). This study indicated that ridge regression method had higher capacity than other dimension reduction methods for the prediction of survival time in patients with DLBCL. Furthermore, a combination of statistical methods and microarray data could help to detect influential genes in survival.
PMCID: PMC3410377  PMID: 23173013
Lymphoma; gene expression; microarray; survival analysis; dimension reduction; ridge regression
10.  Heteroscedastic Ridge Regression Approaches for Genome-Wide Prediction With a Focus on Computational Efficiency and Accurate Effect Estimation 
G3: Genes|Genomes|Genetics  2014;4(3):539-546.
Ridge regression with heteroscedastic marker variances provides an alternative to Bayesian genome-wide prediction methods. Our objectives were to suggest new methods to determine marker-specific shrinkage factors for heteroscedastic ridge regression and to investigate their properties with respect to computational efficiency and accuracy of estimated effects. We analyzed published data sets of maize, wheat, and sugar beet as well as simulated data with the new methods. Ridge regression with shrinkage factors that were proportional to single-marker analysis of variance estimates of variance components (i.e., RRWA) was the fastest method. It required computation times of less than 1 sec for medium-sized data sets, which have dimensions that are common in plant breeding. A modification of the expectation-maximization algorithm that yields heteroscedastic marker variances (i.e., RMLV) resulted in the most accurate marker effect estimates. It outperformed the homoscedastic ridge regression approach for best linear unbiased prediction in particular for situations with high marker density and strong linkage disequilibrium along the chromosomes, a situation that occurs often in plant breeding populations. We conclude that the RRWA and RMLV approaches provide alternatives to the commonly used Bayesian methods, in particular for applications in which computational feasibility or accuracy of effect estimates are important, such as detection or functional analysis of genes or planning crosses.
PMCID: PMC3962491  PMID: 24449687
genome-wide prediction; ridge regression; heteroscedastic marker variances; linkage disequilibrium; plant breeding populations; GenPred; Shared data resources
11.  Survival prediction from clinico-genomic models - a comparative study 
BMC Bioinformatics  2009;10:413.
Survival prediction from high-dimensional genomic data is an active field in today's medical research. Most of the proposed prediction methods make use of genomic data alone without considering established clinical covariates that often are available and known to have predictive value. Recent studies suggest that combining clinical and genomic information may improve predictions, but there is a lack of systematic studies on the topic. Also, for the widely used Cox regression model, it is not obvious how to handle such combined models.
We propose a way to combine classical clinical covariates with genomic data in a clinico-genomic prediction model based on the Cox regression model. The prediction model is obtained by a simultaneous use of both types of covariates, but applying dimension reduction only to the high-dimensional genomic variables. We describe how this can be done for seven well-known prediction methods: variable selection, unsupervised and supervised principal components regression and partial least squares regression, ridge regression, and the lasso. We further perform a systematic comparison of the performance of prediction models using clinical covariates only, genomic data only, or a combination of the two. The comparison is done using three survival data sets containing both clinical information and microarray gene expression data. Matlab code for the clinico-genomic prediction methods is available at
Based on our three data sets, the comparison shows that established clinical covariates will often lead to better predictions than what can be obtained from genomic data alone. In the cases where the genomic models are better than the clinical, ridge regression is used for dimension reduction. We also find that the clinico-genomic models tend to outperform the models based on only genomic data. Further, clinico-genomic models and the use of ridge regression gives for all three data sets better predictions than models based on the clinical covariates alone.
PMCID: PMC2811121  PMID: 20003386
12.  Patterns of Obesity Development before the Diagnosis of Type 2 Diabetes: The Whitehall II Cohort Study 
PLoS Medicine  2014;11(2):e1001602.
Examining patterns of change in body mass index (BMI) and other cardiometabolic risk factors in individuals during the years before they were diagnosed with diabetes, Kristine Færch and colleagues report that few of them experienced dramatic BMI changes.
Please see later in the article for the Editors' Summary
Patients with type 2 diabetes vary greatly with respect to degree of obesity at time of diagnosis. To address the heterogeneity of type 2 diabetes, we characterised patterns of change in body mass index (BMI) and other cardiometabolic risk factors before type 2 diabetes diagnosis.
Methods and Findings
We studied 6,705 participants from the Whitehall II study, an observational prospective cohort study of civil servants based in London. White men and women, initially free of diabetes, were followed with 5-yearly clinical examinations from 1991–2009 for a median of 14.1 years (interquartile range [IQR]: 8.7–16.2 years). Type 2 diabetes developed in 645 (1,209 person-examinations) and 6,060 remained free of diabetes during follow-up (14,060 person-examinations). Latent class trajectory analysis of incident diabetes cases was used to identify patterns of pre-disease BMI. Associated trajectories of cardiometabolic risk factors were studied using adjusted mixed-effects models. Three patterns of BMI changes were identified. Most participants belonged to the “stable overweight” group (n = 604, 94%) with a relatively constant BMI level within the overweight category throughout follow-up. They experienced slightly worsening of beta cell function and insulin sensitivity from 5 years prior to diagnosis. A small group of “progressive weight gainers” (n = 15) exhibited a pattern of consistent weight gain before diagnosis. Linear increases in blood pressure and an exponential increase in insulin resistance a few years before diagnosis accompanied the weight gain. The “persistently obese” (n = 26) were severely obese throughout the whole 18 years before diabetes diagnosis. They experienced an initial beta cell compensation followed by loss of beta cell function, whereas insulin sensitivity was relatively stable. Since the generalizability of these findings is limited, the results need confirmation in other study populations.
Three patterns of obesity changes prior to diabetes diagnosis were accompanied by distinct trajectories of insulin resistance and other cardiometabolic risk factors in a white, British population. While these results should be verified independently, the great majority of patients had modest weight gain prior to diagnosis. These results suggest that strategies focusing on small weight reductions for the entire population may be more beneficial than predominantly focusing on weight loss for high-risk individuals.
Please see later in the article for the Editors' Summary
Editors' Summary
Worldwide, more than 350 million people have diabetes, a metabolic disorder characterized by high amounts of glucose (sugar) in the blood. Blood sugar levels are normally controlled by insulin, a hormone released by the pancreas after meals (digestion of food produces glucose). In people with type 2 diabetes (the commonest form of diabetes) blood sugar control fails because the fat and muscle cells that normally respond to insulin by removing sugar from the blood become insulin resistant. Type 2 diabetes, which was previously called adult-onset diabetes, can be controlled with diet and exercise, and with drugs that help the pancreas make more insulin or that make cells more sensitive to insulin. Long-term complications, which include an increased risk of heart disease and stroke, reduce the life expectancy of people with diabetes by about 10 years compared to people without diabetes. The number of people with diabetes is expected to increase dramatically over the next decades, coinciding with rising obesity rates in many countries. To better understand diabetes development, to identify people at risk, and to find ways to prevent the disease are urgent public health goals.
Why Was This Study Done?
It is known that people who are overweight or obese have a higher risk of developing diabetes. Because of this association, a common assumption is that people who experienced recent weight gain are more likely to be diagnosed with diabetes. In this prospective cohort study (an investigation that records the baseline characteristics of a group of people and then follows them to see who develops specific conditions), the researchers tested the hypothesis that substantial weight gain precedes a diagnosis of diabetes and explored more generally the patterns of body weight and composition in the years before people develop diabetes. They then examined whether changes in body weight corresponded with changes in other risk factors for diabetes (such as insulin resistance), lipid profiles and blood pressure.
What Did the Researchers Do and Find?
The researchers studied participants from the Whitehall II study, a prospective cohort study initiated in 1985 to investigate the socioeconomic inequalities in disease. Whitehall II enrolled more than 10,000 London-based government employees. Participants underwent regular health checks during which their weight and height were measured, blood tests were done, and they filled out questionnaires for other relevant information. From 1991 onwards, participants were tested every five years for diabetes. The 6,705 participants included in this study were initially free of diabetes, and most of them were followed for at least 14 years. During the follow-up, 645 participants developed diabetes, while 6,060 remained free of the disease.
The researchers used a statistical tool called “latent class trajectory analysis” to study patterns of changes in body mass index (BMI) in the years before people developed diabetes. BMI is a measure of human obesity based on a person's weight and height. Latent class trajectory analysis is an unbiased way to subdivide a number of people into groups that differ based on specified parameters. In this case, the researchers wanted to identify several groups among all the people who eventually developed diabetes each with a distinct pattern of BMI development. Having identified such groups, they also examined how a variety of tests associated with diabetes risk, and risks for heart disease and stroke changed in the identified groups over time.
They identified three different patterns of BMI changes in the 645 participants who developed diabetes. The vast majority (606 individuals, or 94%) belonged to a group they called “stable-overweight.” These people showed no dramatic change in their BMI in the years before they were diagnosed. They were overweight when they first entered the study and gained or lost little weight during the follow-up years. They showed only minor signs of insulin-resistance, starting five years before they developed diabetes. A second, much smaller group of 15 people gained weight consistently in the years before diagnosis. As they were gaining weight, these people also had raises in blood pressure and substantial gains in insulin resistance. The 26 remaining participants who formed the third group were persistently obese for the entire time they participated in the study, in some cases up to 18 years before they were diagnosed with diabetes. They had some signs of insulin resistance in the years before diagnosis, but not the substantial gain often seen as the hallmark of “pre-diabetes.”
What Do These Findings Mean?
These results suggest that diabetes development is a complicated process, and one that differs between individuals who end up with the disease. They call into question the common notion that most people who develop diabetes have recently gained a lot of weight or are obese. A substantial rise in insulin resistance, another established risk factor for diabetes, was only seen in the smallest of the groups, namely the people who gained weight consistently for years before they were diagnosed. When the scientists applied a commonly used predictor of diabetes called the “Framingham diabetes risk score” to their largest “stably overweight” group, they found that these people were not classified as having a particularly high risk, and that their risk scores actually declined in the last five years before their diabetes diagnosis. This suggests that predicting diabetes in this group might be difficult.
The researchers applied their methodology only to this one cohort of white civil servants in England. Before drawing more firm conclusions on the process of diabetes development, it will be important to test whether similar results are seen in other cohorts and among more diverse individuals. If the three groups identified here are found in other cohorts, another question is whether they are as unequal in size as in this example. And if they are, can the large group of stably overweight people be further subdivided in ways that suggest specific mechanisms of disease development? Even without knowing how generalizable the provocative findings of this study are, they should stimulate debate on how to identify people at risk for diabetes and how to prevent the disease or delay its onset.
Additional Information
Please access these Web sites via the online version of this summary at
The US National Diabetes Information Clearinghouse provides information about diabetes for patients, health-care professionals, and the general public, including information on diabetes prevention (in English and Spanish)
The UK National Health Service Choices website provides information for patients and carers about type 2 diabetes; it includes people's stories about diabetes
The charity Diabetes UK also provides detailed information about diabetes for patients and carers, including information on healthy lifestyles for people with diabetes, and has a further selection of stories from people with diabetes; the charity Healthtalkonline has interviews with people about their experiences of diabetes
MedlinePlus provides links to further resources and advice about diabetes (in English and Spanish)
More information about the Whitehall II study is available
PMCID: PMC3921118  PMID: 24523667
13.  Urbanicity and Lifestyle Risk Factors for Cardiometabolic Diseases in Rural Uganda: A Cross-Sectional Study 
PLoS Medicine  2014;11(7):e1001683.
Johanna Riha and colleagues evaluate the association of lifestyle risk factors with elements of urbanicity, such as having a public telephone, a primary school, or a hospital, among individuals living in rural settings in Uganda.
Please see later in the article for the Editors' Summary
Urban living is associated with unhealthy lifestyles that can increase the risk of cardiometabolic diseases. In sub-Saharan Africa (SSA), where the majority of people live in rural areas, it is still unclear if there is a corresponding increase in unhealthy lifestyles as rural areas adopt urban characteristics. This study examines the distribution of urban characteristics across rural communities in Uganda and their associations with lifestyle risk factors for chronic diseases.
Methods and Findings
Using data collected in 2011, we examined cross-sectional associations between urbanicity and lifestyle risk factors in rural communities in Uganda, with 7,340 participants aged 13 y and above across 25 villages. Urbanicity was defined according to a multi-component scale, and Poisson regression models were used to examine associations between urbanicity and lifestyle risk factors by quartile of urbanicity. Despite all of the villages not having paved roads and running water, there was marked variation in levels of urbanicity across the villages, largely attributable to differences in economic activity, civil infrastructure, and availability of educational and healthcare services. In regression models, after adjustment for clustering and potential confounders including socioeconomic status, increasing urbanicity was associated with an increase in lifestyle risk factors such as physical inactivity (risk ratio [RR]: 1.19; 95% CI: 1.14, 1.24), low fruit and vegetable consumption (RR: 1.17; 95% CI: 1.10, 1.23), and high body mass index (RR: 1.48; 95% CI: 1.24, 1.77).
This study indicates that even across rural communities in SSA, increasing urbanicity is associated with a higher prevalence of lifestyle risk factors for cardiometabolic diseases. This finding highlights the need to consider the health impact of urbanization in rural areas across SSA.
Please see later in the article for the Editors' Summary
Editors’ Summary
Cardiometabolic diseases—cardiovascular diseases that affect the heart and/or the blood vessels and metabolic diseases that affect the cellular chemical reactions needed to sustain life—are a growing global health concern. In sub-Saharan Africa, for example, the prevalence (the proportion of a population that has a given disease) of adults with diabetes (a life-shortening metabolic disease that affects how the body handles sugars) is currently 3.8%. By 2030, it is estimated that the prevalence of diabetes among adults in this region will have risen to 4.6%. Similarly, in 2004, around 1.2 million deaths in sub-Saharan Africa were attributed to coronary heart disease, heart failure, stroke, and other cardiovascular diseases. By 2030, the number of deaths in this region attributable to cardiovascular disease is expected to double. Globally, cardiovascular disease and diabetes are now responsible for around 17.3 million and 1.3 million annual deaths, respectively, together accounting for about one-third of all deaths.
Why Was This Study Done?
Experts believe that increased consumption of saturated fats, sugar, and salt and reduced physical activity are partly responsible for the increasing global prevalence of cardiometabolic diseases. These lifestyle changes, they suggest, are related to urbanization—urban expansion into the countryside and migration from rural to urban areas. If this is true, the prevalence of unhealthy lifestyles should increase as rural areas adopt urban characteristics. Sub-Saharan Africa is the least urbanized region in the world, with about 60% of the population living in rural areas. However, rural settlements across the subcontinent are increasingly adopting urban characteristics. It is important to know whether urbanization is affecting the health of rural residents in sub-Saharan Africa to improve estimates of the future burden of cardiometabolic diseases in the region and to provide insights into ways to limit this burden. In this cross-sectional study (an investigation that studies participants at a single time point), the researchers examine the distribution of urban characteristics across rural communities in Uganda and the association of these characteristics with lifestyle risk factors for cardiometabolic diseases.
What Did the Researchers Do and Find?
For their study, the researchers used data collected in 2011 by the General Population Cohort study, a study initiated in 1989 to describe HIV infection trends among people living in 25 villages in rural southwestern Uganda that collects health-related and other information annually from its participants. The researchers quantified the “urbanicity” of the 25 villages using a multi-component scale that included information such as village size and economic activity. They then used statistical models to examine associations between urbanicity and lifestyle risk factors such as body mass index (BMI, a measure of obesity) and self-reported fruit and vegetable consumption for more than 7,000 study participants living in those villages. None of the villages had paved roads or running water. However, urbanicity varied markedly across the villages, largely because of differences in economic activity, civil infrastructure, and the availability of educational and healthcare services. Notably, increasing urbanicity was associated with an increase in lifestyle risk factors for cardiovascular diseases. So, for example, people living in villages with the highest urbanicity scores were nearly 20% more likely to be physically inactive and to eat less fruits and vegetables and nearly 50% more likely to have a high BMI than people living in villages with the lowest urbanicity scores.
What Do These Findings Mean?
These findings indicate that, across rural communities in Uganda, even a small increase in urbanicity is associated with a higher prevalence of potentially modifiable lifestyle risk factors for cardiometabolic diseases. These findings suggest, therefore, that simply classifying settlements as either rural or urban may not be adequate to capture the information needed to target strategies for cardiometabolic disease management and control in rural areas as they become more urbanized. Because this study was cross-sectional, it is not possible to say how long a rural population needs to experience a more urban environment before its risk of cardiometabolic diseases increases. Longitudinal studies are needed to obtain this information. Moreover, studies of other countries in sub-Saharan Africa are needed to show that these findings are generalizable across the region. However, based on these findings, and given that more than 553 million people live in rural areas across sub-Saharan Africa, it seems likely that increasing urbanization will have a substantial impact on the future health of populations throughout sub-Saharan Africa.
Additional Information
Please access these websites via the online version of this summary at
This study is further discussed in a PLOS Medicine Perspective by Fahad Razak and Lisa Berkman
The American Heart Association provides information on all aspects of cardiovascular disease and diabetes; its website includes personal stories about heart attacks, stroke, and diabetes
The US Centers for Disease Control and Prevention has information on heart disease, stroke, and diabetes (in English and Spanish)
The UK National Health Service Choices website provides information about cardiovascular disease and diabetes (including some personal stories)
The World Health Organization’s Global Noncommunicable Disease Network (NCDnet) aims to help low- and middle-income countries reduce illness and death caused by cardiometabolic and other non-communicable diseases
The World Heart Federation has recently produced a report entitled “Urbanization and Cardiovascular Disease”
Wikipedia has a page on urbanization (note that Wikipedia is a free online encyclopedia that anyone can edit; available in several languages)
PMCID: PMC4114555  PMID: 25072243
14.  Pesticide Leaching from Agricultural Fields with Ridges and Furrows 
Water, Air, and Soil Pollution  2010;213(1-4):341-352.
In the evaluation of the risk of pesticide leaching to groundwater, the soil surface is usually assumed to be level, although important crops like potato are grown on ridges. A fraction of the water from rainfall and sprinkler irrigation may flow along the soil surface from the ridges to the furrows, thus bringing about an extra load of water and pesticide on the furrow soil. A survey of the literature reveals that surface-runoff from ridges to furrows is a well-known phenomenon but that hardly any data are available on the quantities of water and pesticide involved. On the basis of a field experiment with additional sprinkler irrigation, computer simulations were carried out with the Pesticide Emission Assessment at Regional and Local scales model for separate ridge and furrow systems in a humic sandy potato field. Breakthrough curves of bromide ion (as a tracer for water flow) and carbofuran (as example pesticide) were calculated for 1-m depth in the field. Bromide ion leached comparatively fast from the furrow system, while leaching from the ridge system was slower showing a maximum concentration of about half of that for the furrow system. Carbofuran breakthrough from the furrow system began about a month after application and increased steadily to substantial concentrations. Because the transport time of carbofuran in the ridge soil was much longer, no breakthrough occurred in the growing season. The maximum concentration of carbofuran leaching from the ridge–furrow field was computed to be a factor of six times as high as that computed for the corresponding level field. The study shows that the risk of leaching of pesticides via the furrow soil can be substantially higher than that via the corresponding level field soil.
PMCID: PMC2956044  PMID: 21076668
Bromide ion; Carbofuran; Computer model; Groundwater; Insecticide; Potato crop; Sandy soils; Simulation model
15.  Comparison of statistical procedures for estimating polygenic effects using dense genome-wide marker data 
BMC Proceedings  2009;3(Suppl 1):S12.
In this study we compared different statistical procedures for estimating SNP effects using the simulated data set from the XII QTL-MAS workshop. Five procedures were considered and tested in a reference population, i.e., the first four generations, from which phenotypes and genotypes were available. The procedures can be interpreted as variants of ridge regression, with different ways for defining the shrinkage parameter. Comparisons were made with respect to the correlation between genomic and conventional estimated breeding values. Moderate correlations were obtained from all methods. Two of them were used to predict genomic breeding values in the last three generations. Correlations between these and the true breeding values were also moderate. We concluded that the ridge regression procedures applied in this study did not outperform the simple use of a ratio of variances in a mixed model method, both providing moderate accuracies of predicted genomic breeding values.
PMCID: PMC2654493  PMID: 19278538
16.  Regularized estimation of large-scale gene association networks using graphical Gaussian models 
BMC Bioinformatics  2009;10:384.
Graphical Gaussian models are popular tools for the estimation of (undirected) gene association networks from microarray data. A key issue when the number of variables greatly exceeds the number of samples is the estimation of the matrix of partial correlations. Since the (Moore-Penrose) inverse of the sample covariance matrix leads to poor estimates in this scenario, standard methods are inappropriate and adequate regularization techniques are needed. Popular approaches include biased estimates of the covariance matrix and high-dimensional regression schemes, such as the Lasso and Partial Least Squares.
In this article, we investigate a general framework for combining regularized regression methods with the estimation of Graphical Gaussian models. This framework includes various existing methods as well as two new approaches based on ridge regression and adaptive lasso, respectively. These methods are extensively compared both qualitatively and quantitatively within a simulation study and through an application to six diverse real data sets. In addition, all proposed algorithms are implemented in the R package "parcor", available from the R repository CRAN.
In our simulation studies, the investigated non-sparse regression methods, i.e. Ridge Regression and Partial Least Squares, exhibit rather conservative behavior when combined with (local) false discovery rate multiple testing in order to decide whether or not an edge is present in the network. For networks with higher densities, the difference in performance of the methods decreases. For sparse networks, we confirm the Lasso's well known tendency towards selecting too many edges, whereas the two-stage adaptive Lasso is an interesting alternative that provides sparser solutions. In our simulations, both sparse and non-sparse methods are able to reconstruct networks with cluster structures. On six real data sets, we also clearly distinguish the results obtained using the non-sparse methods and those obtained using the sparse methods where specification of the regularization parameter automatically means model selection. In five out of six data sets, Partial Least Squares selects very dense networks. Furthermore, for data that violate the assumption of uncorrelated observations (due to replications), the Lasso and the adaptive Lasso yield very complex structures, indicating that they might not be suited under these conditions. The shrinkage approach is more stable than the regression based approaches when using subsampling.
PMCID: PMC2808166  PMID: 19930695
17.  Linkage Analysis of a Model Quantitative Trait in Humans: Finger Ridge Count Shows Significant Multivariate Linkage to 5q14.1 
PLoS Genetics  2007;3(9):e165.
The finger ridge count (a measure of pattern size) is one of the most heritable complex traits studied in humans and has been considered a model human polygenic trait in quantitative genetic analysis. Here, we report the results of the first genome-wide linkage scan for finger ridge count in a sample of 2,114 offspring from 922 nuclear families. Both univariate linkage to the absolute ridge count (a sum of all the ridge counts on all ten fingers), and multivariate linkage analyses of the counts on individual fingers, were conducted. The multivariate analyses yielded significant linkage to 5q14.1 (Logarithm of odds [LOD] = 3.34, pointwise-empirical p-value = 0.00025) that was predominantly driven by linkage to the ring, index, and middle fingers. The strongest univariate linkage was to 1q42.2 (LOD = 2.04, point-wise p-value = 0.002, genome-wide p-value = 0.29). In summary, the combination of univariate and multivariate results was more informative than simple univariate analyses alone. Patterns of quantitative trait loci factor loadings consistent with developmental fields were observed, and the simple pleiotropic model underlying the absolute ridge count was not sufficient to characterize the interrelationships between the ridge counts of individual fingers.
Author Summary
Finger ridge count (an index of the size of the fingerprint pattern) has been used as a model trait for the study of human quantitative genetics for over 80 years. Here, we present the first genome-wide linkage scan for finger ridge count in a large sample of 2,114 offspring from 922 nuclear families. Our results illustrate the increase in power and information that can be gained from a multivariate linkage analysis of ridge counts of individual fingers as compared to a univariate analysis of a summary measure (absolute ridge count). The strongest evidence for linkage was seen at 5q14.1, and the pattern of loadings was consistent with a developmental field factor whose influence is greatest on the ring finger, falling off to either side, which is consistent with previous findings that heritability for ridge count is higher for the middle three fingers. We feel that the paper will be of specific methodological interest to those conducting linkage and association analyses with summary measures. In addition, given the frequency with which this phenotype is used as a didactic example in genetics courses we feel that this paper will be of interest to the general scientific community.
PMCID: PMC1994711  PMID: 17907812
18.  Genome-wide selection by mixed model ridge regression and extensions based on geostatistical models 
BMC Proceedings  2010;4(Suppl 1):S8.
The success of genome-wide selection (GS) approaches will depend crucially on the availability of efficient and easy-to-use computational tools. Therefore, approaches that can be implemented using mixed models hold particular promise and deserve detailed study. A particular class of mixed models suitable for GS is given by geostatistical mixed models, when genetic distance is treated analogously to spatial distance in geostatistics.
We consider various spatial mixed models for use in GS. The analyses presented for the QTL-MAS 2009 dataset pay particular attention to the modelling of residual errors as well as of polygenetic effects.
It is shown that geostatistical models are viable alternatives to ridge regression, one of the common approaches to GS. Correlations between genome-wide estimated breeding values and true breeding values were between 0.879 and 0.889. In the example considered, we did not find a large effect of the residual error variance modelling, largely because error variances were very small. A variance components model reflecting the pedigree of the crosses did not provide an improved fit.
We conclude that geostatistical models deserve further study as a tool to GS that is easily implemented in a mixed model package.
PMCID: PMC2857850  PMID: 20380762
19.  Longitudinal deformation models, spatial regularizations and learning strategies to quantify Alzheimer's disease progression 
NeuroImage : Clinical  2014;4:718-729.
In the context of Alzheimer's disease, two challenging issues are (1) the characterization of local hippocampal shape changes specific to disease progression and (2) the identification of mild-cognitive impairment patients likely to convert. In the literature, (1) is usually solved first to detect areas potentially related to the disease. These areas are then considered as an input to solve (2). As an alternative to this sequential strategy, we investigate the use of a classification model using logistic regression to address both issues (1) and (2) simultaneously. The classification of the patients therefore does not require any a priori definition of the most representative hippocampal areas potentially related to the disease, as they are automatically detected. We first quantify deformations of patients' hippocampi between two time points using the large deformations by diffeomorphisms framework and transport these deformations to a common template. Since the deformations are expected to be spatially structured, we perform classification combining logistic loss and spatial regularization techniques, which have not been explored so far in this context, as far as we know. The main contribution of this paper is the comparison of regularization techniques enforcing the coefficient maps to be spatially smooth (Sobolev), piecewise constant (total variation) or sparse (fused LASSO) with standard regularization techniques which do not take into account the spatial structure (LASSO, ridge and ElasticNet). On a dataset of 103 patients out of ADNI, the techniques using spatial regularizations lead to the best classification rates. They also find coherent areas related to the disease progression.
•Study of deformation models for longitudinal analysis•New framework combining LDDMM, logistic regression and spatial regularizations•Simultaneous disease progression classification and biomarker identification•Validation in the context of Alzheimer's disease on a large dataset from ADNI
PMCID: PMC4053641  PMID: 24936423
Alzheimer's disease; Brain imaging; Deformation model; LDDMM; Disease progression; Karcher mean; Transport; Logistic regression; Spatial regularization; Coefficient map
20.  Pre-selection of markers for genomic selection 
BMC Proceedings  2011;5(Suppl 3):S12.
Accurate prediction of genomic breeding values (GEBVs) requires numerous markers. However, predictive accuracy can be enhanced by excluding markers with no effects or with inconsistent effects among crosses that can adversely affect the prediction of GEBVs.
We present three different approaches for pre-selecting markers prior to predicting GEBVs using four different BLUP methods, including ridge regression and three spatial models. Performances of the models were evaluated using 5-fold cross-validation.
Results and conclusions
Ridge regression and the spatial models gave essentially similar fits. Pre-selecting markers was evidently beneficial since excluding markers with inconsistent effects among crosses increased the correlation between GEBVs and true breeding values of the non-phenotyped individuals from 0.607 (using all markers) to 0.625 (using pre-selected markers). Moreover, extension of the ridge regression model to allow for heterogeneous variances between the most significant subset and the complementary subset of pre-selected markers increased predictive accuracy (from 0.625 to 0.648) for the simulated dataset for the QTL-MAS 2010 workshop.
PMCID: PMC3103197  PMID: 21624168
21.  Cardiometabolic Risk among African-American Women: A Pilot Study 
To determine the associations of the Homeostatic Model of Assessment-insulin resistance (HOMA-ir), acanthosis nigricans, high sensitivity C-reactive protein (hs-CRP), and plasminogen activator inhibitor-1 (PAI-1) with two of the commonly used definitions of the metabolic syndrome (Adult Treatment Panel III {ATP III} and International Diabetes Federation {IDF}) among reproductive age healthy free living African-American women.
A pilot study with a cross-sectional design examined 33 African-American women aged 20 to 46 (mean 31.24, +/- 7.25), for the presence of metabolic syndrome determined by ATP III and IDF criteria, insulin resistance (HOMA-ir and/or acanthosis nigricans), degree of inflammation (hs-CRP) and presence of dysfibrinolysis (PAI-1).
HOMA-ir identified insulin resistance in 27 (81.8%) of the women, whereas the presence of acanthosis nigricans indicated that 16 (48 %) of these women manifested insulin resistance. Metabolic syndrome was found in 7 women (21.2 %) by ATP III or 9 (27.3 %) by IDF criteria. Bivariate correlations showed associations between HOMA-ir and waist circumference, body mass index (BMI), acanthosis nigricans, the ATP III and IDF definitions for metabolic syndrome. PAI-1 was significantly correlated with waist circumference, BMI, fasting glucose, HOMA-ir, and ATP III. Both HOMA-ir and PAI-1 were significantly and negatively correlated with HDL-C. hs-CRP was significantly correlated with BMI and 2-hour post glucose.
Both dysfibrinolysis (PAI-1 levels) and insulin resistance (HOMA-ir) when individually regressed on the ATP III definition of metabolic syndrome explained 32 % and 29% of the respective variance. The addition of HOMA-ir measurement may significantly improve early recognition of cardiometabolic risk among reproductive age African-American women who have not yet met the criteria for the ATP III or IDF definitions of the metabolic syndrome. Likewise, acanthosis nigricans is potentially a clinically significant screening tool when used to determine early recognition of insulin resistance and/or cardiometabolic risk among this population.
African-American women's risk for CVD is likely underestimated based on the sole use of ATP III criteria for diagnosis of metabolic syndrome. Clinicians should consider a broader definition of risk than that is contained within ATP III. Inclusion of biomarkers of inflammation and dysfibrinolysis along with measures of insulin resistance may add to early detection of cardiometabolic risk, and ultimate reduction in cardiovascular health disparities among African-American women.
PMCID: PMC3204876  PMID: 19242280
metabolic syndrome; insulin resistance; inflammation; dysfibrinolysis; Plasminogen Activator Inhibitor-1 (PAI-1); high sensitivity C-reactive protein (hs-CRP); Homeostatic Model of Assessment-insulin resistance (HOMA-ir); health disparity
22.  Nonlinear, Multilevel Mixed-Effects Approach for Modeling Longitudinal Standard Automated Perimetry Data in Glaucoma 
Ordinary least squares linear regression (OLSLR) analyses are inappropriate for performing trend analysis on repeatedly measured longitudinal data. This study examines multilevel linear mixed-effects (LME) and nonlinear mixed-effects (NLME) methods to model longitudinally collected perimetry data and determines whether NLME methods provide significant improvements over LME methods and OLSLR.
Models of LME and NLME (exponential, whereby the rate of change in sensitivity worsens over time) were examined with two levels of nesting (subject and eye within subject) to predict the mean deviation. Models were compared using analysis of variance or Akaike's information criterion and Bayesian information criterion, as appropriate.
Nonlinear (exponential) models provided significantly better fits than linear models (P < 0.0001). Nonlinear fits markedly improved the validity of the model, as evidenced by the lack of significant autocorrelation, residuals that are closer to being normally distributed, and improved homogeneity. From the fitted exponential model, the rate of glaucomatous progression for an average subject of age 70 years was −0.07 decibels (dB) per year. Ten years later, the same eye would be deteriorating at −0.12 dB/y.
Multilevel mixed-effects models provide better fits to the test data than OLSLR by accounting for group effects and/or within-group correlation. However, the fitted LME model poorly tracks visual field (VF) change over time. An exponential model provides a significant improvement over linear models and more accurately tracks VF change over time in this cohort.
OLS methods are inappropriate for performing trend analyses on repeatedly measured longitudinal data. Instead, a nonlinear (exponential) model provides a significant improvement over linear models and more accurately tracks visual field change over time in this cohort.
PMCID: PMC3747790  PMID: 23833069
glaucoma; mean deviation; linear mixed effect; nonlinear mixed effect; autocorrelation
23.  An Information Matrix Prior for Bayesian Analysis in Generalized Linear Models with High Dimensional Data 
Statistica Sinica  2009;19(4):1641-1663.
An important challenge in analyzing high dimensional data in regression settings is that of facing a situation in which the number of covariates p in the model greatly exceeds the sample size n (sometimes termed the “p > n” problem). In this article, we develop a novel specification for a general class of prior distributions, called Information Matrix (IM) priors, for high-dimensional generalized linear models. The priors are first developed for settings in which p < n, and then extended to the p > n case by defining a ridge parameter in the prior construction, leading to the Information Matrix Ridge (IMR) prior. The IM and IMR priors are based on a broad generalization of Zellner’s g-prior for Gaussian linear models. Various theoretical properties of the prior and implied posterior are derived including existence of the prior and posterior moment generating functions, tail behavior, as well as connections to Gaussian priors and Jeffreys’ prior. Several simulation studies and an application to a nucleosomal positioning data set demonstrate its advantages over Gaussian, as well as g-priors, in high dimensional settings.
PMCID: PMC2909687  PMID: 20664718
Fisher Information; g-prior; Importance sampling; Model identifiability; Prior elicitation
24.  Incorporating auxiliary information for improved prediction in high-dimensional datasets: an ensemble of shrinkage approaches 
Biostatistics (Oxford, England)  2012;14(2):259-272.
With advancement in genomic technologies, it is common that two high-dimensional datasets are available, both measuring the same underlying biological phenomenon with different techniques. We consider predicting a continuous outcome Y using X, a set of p markers which is the best available measure of the underlying biological process. This same biological process may also be measured by W, coming from a prior technology but correlated with X. On a moderately sized sample, we have (Y,X,W), and on a larger sample we have (Y,W). We utilize the data on W to boost the prediction of Y by X. When p is large and the subsample containing X is small, this is a p>n situation. When p is small, this is akin to the classical measurement error problem; however, ours is not the typical goal of calibrating W for use in future studies. We propose to shrink the regression coefficients β of Y on X toward different targets that use information derived from W in the larger dataset. We compare these proposals with the classical ridge regression of Y on X, which does not use W. We also unify all of these methods as targeted ridge estimators. Finally, we propose a hybrid estimator which is a linear combination of multiple estimators of β. With an optimal choice of weights, the hybrid estimator balances efficiency and robustness in a data-adaptive way to theoretically yield a smaller prediction error than any of its constituents. The methods, including a fully Bayesian alternative, are evaluated via simulation studies. We also apply them to a gene-expression dataset. mRNA expression measured via quantitative real-time polymerase chain reaction is used to predict survival time in lung cancer patients, with auxiliary information from microarray technology available on a larger sample.
PMCID: PMC3590922  PMID: 23087411
Cross-validation; Generalized ridge; Mean squared prediction error; Measurement error
25.  Metabolic Signatures of Adiposity in Young Adults: Mendelian Randomization Analysis and Effects of Weight Change 
PLoS Medicine  2014;11(12):e1001765.
In this study, Wurtz and colleagues investigated to what extent elevated body mass index (BMI) within the normal weight range has causal influences on the detailed systemic metabolite profile in early adulthood using Mendelian randomization analysis.
Please see later in the article for the Editors' Summary
Increased adiposity is linked with higher risk for cardiometabolic diseases. We aimed to determine to what extent elevated body mass index (BMI) within the normal weight range has causal effects on the detailed systemic metabolite profile in early adulthood.
Methods and Findings
We used Mendelian randomization to estimate causal effects of BMI on 82 metabolic measures in 12,664 adolescents and young adults from four population-based cohorts in Finland (mean age 26 y, range 16–39 y; 51% women; mean ± standard deviation BMI 24±4 kg/m2). Circulating metabolites were quantified by high-throughput nuclear magnetic resonance metabolomics and biochemical assays. In cross-sectional analyses, elevated BMI was adversely associated with cardiometabolic risk markers throughout the systemic metabolite profile, including lipoprotein subclasses, fatty acid composition, amino acids, inflammatory markers, and various hormones (p<0.0005 for 68 measures). Metabolite associations with BMI were generally stronger for men than for women (median 136%, interquartile range 125%–183%). A gene score for predisposition to elevated BMI, composed of 32 established genetic correlates, was used as the instrument to assess causality. Causal effects of elevated BMI closely matched observational estimates (correspondence 87%±3%; R2 = 0.89), suggesting causative influences of adiposity on the levels of numerous metabolites (p<0.0005 for 24 measures), including lipoprotein lipid subclasses and particle size, branched-chain and aromatic amino acids, and inflammation-related glycoprotein acetyls. Causal analyses of certain metabolites and potential sex differences warrant stronger statistical power. Metabolite changes associated with change in BMI during 6 y of follow-up were examined for 1,488 individuals. Change in BMI was accompanied by widespread metabolite changes, which had an association pattern similar to that of the cross-sectional observations, yet with greater metabolic effects (correspondence 160%±2%; R2 = 0.92).
Mendelian randomization indicates causal adverse effects of increased adiposity with multiple cardiometabolic risk markers across the metabolite profile in adolescents and young adults within the non-obese weight range. Consistent with the causal influences of adiposity, weight changes were paralleled by extensive metabolic changes, suggesting a broadly modifiable systemic metabolite profile in early adulthood.
Please see later in the article for the Editors' Summary
Editors' Summary
Adiposity—having excessive body fat—is a growing global threat to public health. Body mass index (BMI, calculated by dividing a person's weight in kilograms by their height in meters squared) is a coarse indicator of excess body weight, but the measure is useful in large population studies. Compared to people with a lean body weight (a BMI of 18.5–24.9 kg/m2), individuals with higher BMI have an elevated risk of developing life-shortening cardiometabolic diseases—cardiovascular diseases that affect the heart and/or the blood vessels (for example, heart failure and stroke) and metabolic diseases that affect the cellular chemical reactions that sustain life (for example, diabetes). People become unhealthily fat by consuming food and drink that contains more energy (calories) than they need for their daily activities. So adiposity can be prevented and reversed by eating less and exercising more.
Why Was This Study Done?
Epidemiological studies, which record the patterns of risk factors and disease in populations, suggest that the illness and death associated with excess body weight is partly attributable to abnormalities in how individuals with high adiposity metabolize carbohydrates and fats, leading to higher blood sugar and cholesterol levels. Further, adiposity is also associated with many other deviations in the metabolic profile than these commonly measured risk factors. However, epidemiological studies cannot prove that adiposity causes specific changes in a person's systemic (overall) metabolic profile because individuals with high BMI may share other characteristics (confounding factors) that are the actual causes of both adiposity and metabolic abnormalities. Moreover, having a change in some aspect of metabolism could also lead to adiposity, rather than vice versa (reverse causation). Importantly, if there is a causal effect of adiposity on cardiometabolic risk factor levels, it might be possible to prevent the progression towards cardiometabolic diseases by weight loss. Here, the researchers use “Mendelian randomization” to examine whether increased BMI within the normal and overweight range is causally influencing the metabolic risk factors from many biological pathways during early adulthood. Because gene variants are inherited randomly, they are not prone to confounding and are free from reverse causation. Several gene variants are known to lead to modestly increased BMI. Thus, an investigation of the associations between these gene variants and risk factors across the systemic metabolite profile in a population of healthy individuals can indicate whether higher BMI is causally related to known and novel metabolic risk factors and higher cardiometabolic disease risk.
What Did the Researchers Do and Find?
The researchers measured the BMI of 12,664 adolescents and young adults (average BMI 24.7 kg/m2) living in Finland and the blood levels of 82 metabolites in these young individuals at a single time point. Statistical analysis of these data indicated that elevated BMI was adversely associated with numerous cardiometabolic risk factors. For example, elevated BMI was associated with raised levels of low-density lipoprotein, “bad” cholesterol that increases cardiovascular disease risk. Next, the researchers used a gene score for predisposition to increased BMI, composed of 32 gene variants correlated with increased BMI, as an “instrumental variable” to assess whether adiposity causes metabolite abnormalities. The effects on the systemic metabolite profile of a 1-kg/m2 increment in BMI due to genetic predisposition closely matched the effects of an observed 1-kg/m2 increment in adulthood BMI on the metabolic profile. That is, higher levels of adiposity had causal effects on the levels of numerous blood-based metabolic risk factors, including higher levels of low-density lipoprotein cholesterol and triglyceride-carrying lipoproteins, protein markers of chronic inflammation and adverse liver function, impaired insulin sensitivity, and elevated concentrations of several amino acids that have recently been linked with the risk for developing diabetes. Elevated BMI also causally led to lower levels of certain high-density lipoprotein lipids in the blood, a marker for the risk of future cardiovascular disease. Finally, an examination of the metabolic changes associated with changes in BMI in 1,488 young adults after a period of six years showed that those metabolic measures that were most strongly associated with BMI at a single time point likewise displayed the highest responsiveness to weight change over time.
What Do These Findings Mean?
These findings suggest that increased adiposity has causal adverse effects on multiple cardiometabolic risk markers in non-obese young adults beyond the effects on cholesterol and blood sugar. Like all Mendelian randomization studies, the reliability of the causal association reported here depends on several assumptions made by the researchers. Nevertheless, these findings suggest that increased adiposity has causal adverse effects on multiple cardiometabolic risk markers in non-obese young adults. Importantly, the results of both the causal effect analyses and the longitudinal study suggest that there is no threshold below which a BMI increase does not adversely affect the metabolic profile, and that a systemic metabolic profile linked with high cardiometabolic disease risk that becomes established during early adulthood can be reversed. Overall, these findings therefore highlight the importance of weight reduction as a key target for metabolic risk factor control among young adults.
Additional Information
Please access these websites via the online version of this summary at
The Computational Medicine Research Team of the University of Oulu has a webpage that provides further information on metabolite profiling by high-throughput NMR metabolomics
The World Health Organization provides information on obesity (in several languages)
The Global Burden of Disease Study website provides the latest details about global obesity trends
The UK National Health Service Choices website provides information about obesity, cardiovascular disease, and type 2 diabetes (including some personal stories)
The American Heart Association provides information on all aspects of cardiovascular disease and diabetes and on keeping healthy; its website includes personal stories about heart attacks, stroke, and diabetes
The US Centers for Disease Control and Prevention has information on all aspects of overweight and obesity and information about heart disease, stroke, and diabetes
MedlinePlus provides links to other sources of information on heart disease, vascular disease, and obesity (in English and Spanish)
Wikipedia has a page on Mendelian randomization (note: Wikipedia is a free online encyclopedia that anyone can edit; available in several languages)
PMCID: PMC4260795  PMID: 25490400

Results 1-25 (1384639)