PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (775464)

Clipboard (0)
None

Related Articles

1.  Inequities in maternal and child health outcomes and interventions in Ghana 
BMC Public Health  2012;12:252.
Background
With the date for achieving the targets of the Millennium Development Goals (MDGs) approaching fast, there is a heightened concern about equity, as inequities hamper progress towards the MDGs. Equity-focused approaches have the potential to accelerate the progress towards achieving the health-related MDGs faster than the current pace in a more cost-effective and sustainable manner. Ghana's rate of progress towards MDGs 4 and 5 related to reducing child and maternal mortality respectively is less than what is required to achieve the targets. The objective of this paper is to examine the equity dimension of child and maternal health outcomes and interventions using Ghana as a case study.
Methods
Data from Ghana Demographic and Health Survey 2008 report is analyzed for inequities in selected maternal and child health outcomes and interventions using population-weighted, regression-based measures: slope index of inequality and relative index of inequality.
Results
No statistically significant inequities are observed in infant and under-five mortality, perinatal mortality, wasting and acute respiratory infection in children. However, stunting, underweight in under-five children, anaemia in children and women, childhood diarrhoea and underweight in women (BMI < 18.5) show inequities that are to the disadvantage of the poorest. The rates significantly decrease among the wealthiest quintile as compared to the poorest. In contrast, overweight (BMI 25-29.9) and obesity (BMI ≥ 30) among women reveals a different trend - there are inequities in favour of the poorest. In other words, in Ghana overweight and obesity increase significantly among women in the wealthiest quintile compared to the poorest. With respect to interventions: treatment of diarrhoea in children, receiving all basic vaccines among children and sleeping under ITN (children and pregnant women) have no wealth-related gradient. Skilled care at birth, deliveries in a health facility (both public and private), caesarean section, use of modern contraceptives and intermittent preventive treatment for malaria during pregnancy all indicate gradients that are in favour of the wealthiest. The poorest use less of these interventions. Not unexpectedly, there is more use of home delivery among women of the poorest quintile.
Conclusion
Significant Inequities are observed in many of the selected child and maternal health outcomes and interventions. Failure to address these inequities vigorously is likely to lead to non-achievement of the MDG targets related to improving child and maternal health (MDGs 4 and 5). The government should therefore give due attention to tackling inequities in health outcomes and use of interventions by implementing equity-enhancing measure both within and outside the health sector in line with the principles of Primary Health Care and the recommendations of the WHO Commission on Social Determinants of Health.
doi:10.1186/1471-2458-12-252
PMCID: PMC3338377  PMID: 22463465
2.  Bias in random forest variable importance measures: Illustrations, sources and a solution 
BMC Bioinformatics  2007;8:25.
Background
Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories.
Results
Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand.
Conclusion
We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analyzing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research.
doi:10.1186/1471-2105-8-25
PMCID: PMC1796903  PMID: 17254353
3.  SNP interaction detection with Random Forests in high-dimensional genetic data 
BMC Bioinformatics  2012;13:164.
Background
Identifying variants associated with complex human traits in high-dimensional data is a central goal of genome-wide association studies. However, complicated etiologies such as gene-gene interactions are ignored by the univariate analysis usually applied in these studies. Random Forests (RF) are a popular data-mining technique that can accommodate a large number of predictor variables and allow for complex models with interactions. RF analysis produces measures of variable importance that can be used to rank the predictor variables. Thus, single nucleotide polymorphism (SNP) analysis using RFs is gaining popularity as a potential filter approach that considers interactions in high-dimensional data. However, the impact of data dimensionality on the power of RF to identify interactions has not been thoroughly explored. We investigate the ability of rankings from variable importance measures to detect gene-gene interaction effects and their potential effectiveness as filters compared to p-values from univariate logistic regression, particularly as the data becomes increasingly high-dimensional.
Results
RF effectively identifies interactions in low dimensional data. As the total number of predictor variables increases, probability of detection declines more rapidly for interacting SNPs than for non-interacting SNPs, indicating that in high-dimensional data the RF variable importance measures are capturing marginal effects rather than capturing the effects of interactions.
Conclusions
While RF remains a promising data-mining technique that extends univariate methods to condition on multiple variables simultaneously, RF variable importance measures fail to detect interaction effects in high-dimensional data in the absence of a strong marginal component, and therefore may not be useful as a filter technique that allows for interaction effects in genome-wide data.
doi:10.1186/1471-2105-13-164
PMCID: PMC3463421  PMID: 22793366
4.  Ranking causal variants and associated regions in genome-wide association studies by the support vector machine and random forest 
Nucleic Acids Research  2011;39(9):e62.
We study the number of causal variants and associated regions identified by top SNPs in rankings given by the popular 1 df chi-squared statistic, support vector machine (SVM) and the random forest (RF) on simulated and real data. If we apply the SVM and RF to the top 2r chi-square-ranked SNPs, where r is the number of SNPs with P-values within the Bonferroni correction, we find that both improve the ranks of causal variants and associated regions and achieve higher power on simulated data. These improvements, however, as well as stability of the SVM and RF rankings, progressively decrease as the cutoff increases to 5r and 10r. As applications we compare the ranks of previously replicated SNPs in real data, associated regions in type 1 diabetes, as provided by the Type 1 Diabetes Consortium, and disease risk prediction accuracies as given by top ranked SNPs by the three methods. Software and webserver are available at http://svmsnps.njit.edu.
doi:10.1093/nar/gkr064
PMCID: PMC3089490  PMID: 21317188
5.  Stability of variable importance scores and rankings using statistical learning tools on single-nucleotide polymorphisms and risk factors involved in gene × gene and gene × environment interactions 
BMC Proceedings  2007;1(Suppl 1):S58.
Risk of complex disorders is thought to be multifactorial, involving interactions between risk factors. However, many genetic studies assess association between disease status and markers one single-nucleotide polymorphism (SNP) at a time, due to the high-dimensional nature of the search space of all possible interactions. Three ensemble methods have been recently proposed for use in high-dimensional data (Monte Carlo logic regression, random forests, and generalized boosted regression). An intuitive way to detect an association between genetic markers and disease status is to use variable importance measures, even though the stability of these measures in the context of a whole-genome association study is unknown. For the simulated data of Problem 3 in the Genetic Analysis Workshop 15 (GAW15), we examined the variability of both rankings and magnitude of variable importance measures using 10 variables simulated to participate in gene × gene and gene × environment interactions. We conducted 500 analyses per method on one randomly selected replicate, tallying the rankings and importance measures for each of the 10 variables of interest. When the simulated effect size was strong, all three methods showed stable rankings and estimates of variable importance. However, under conditions more commonly expected to be encountered in complex diseases, random forests and generalized boosted regression showed more stable estimates of variable importance and variable rankings. Individuals endeavoring to apply statistical learning methods to detect interaction in complex disease studies should perform repeated analyses in order to assure variable importance measures and rankings do not vary greatly, even for statistical learning algorithms that are thought to be stable.
PMCID: PMC2367584  PMID: 18466558
6.  Evaluating the Relative Environmental Impact of Countries 
PLoS ONE  2010;5(5):e10440.
Environmental protection is critical to maintain ecosystem services essential for human well-being. It is important to be able to rank countries by their environmental impact so that poor performers as well as policy ‘models’ can be identified. We provide novel metrics of country-specific environmental impact ranks – one proportional to total resource availability per country and an absolute (total) measure of impact – that explicitly avoid incorporating confounding human health or economic indicators. Our rankings are based on natural forest loss, habitat conversion, marine captures, fertilizer use, water pollution, carbon emissions and species threat, although many other variables were excluded due to a lack of country-specific data. Of 228 countries considered, 179 (proportional) and 171 (absolute) had sufficient data for correlations. The proportional index ranked Singapore, Korea, Qatar, Kuwait, Japan, Thailand, Bahrain, Malaysia, Philippines and Netherlands as having the highest proportional environmental impact, whereas Brazil, USA, China, Indonesia, Japan, Mexico, India, Russia, Australia and Peru had the highest absolute impact (i.e., total resource use, emissions and species threatened). Proportional and absolute environmental impact ranks were correlated, with mainly Asian countries having both high proportional and absolute impact. Despite weak concordance among the drivers of environmental impact, countries often perform poorly for different reasons. We found no evidence to support the environmental Kuznets curve hypothesis of a non-linear relationship between impact and per capita wealth, although there was a weak reduction in environmental impact as per capita wealth increases. Using structural equation models to account for cross-correlation, we found that increasing wealth was the most important driver of environmental impact. Our results show that the global community not only has to encourage better environmental performance in less-developed countries, especially those in Asia, there is also a requirement to focus on the development of environmentally friendly practices in wealthier countries.
doi:10.1371/journal.pone.0010440
PMCID: PMC2862718  PMID: 20454670
7.  Screening large-scale association study data: exploiting interactions using random forests 
BMC Genetics  2004;5:32.
Background
Genome-wide association studies for complex diseases will produce genotypes on hundreds of thousands of single nucleotide polymorphisms (SNPs). A logical first approach to dealing with massive numbers of SNPs is to use some test to screen the SNPs, retaining only those that meet some criterion for futher study. For example, SNPs can be ranked by p-value, and those with the lowest p-values retained. When SNPs have large interaction effects but small marginal effects in a population, they are unlikely to be retained when univariate tests are used for screening. However, model-based screens that pre-specify interactions are impractical for data sets with thousands of SNPs. Random forest analysis is an alternative method that produces a single measure of importance for each predictor variable that takes into account interactions among variables without requiring model specification. Interactions increase the importance for the individual interacting variables, making them more likely to be given high importance relative to other variables. We test the performance of random forests as a screening procedure to identify small numbers of risk-associated SNPs from among large numbers of unassociated SNPs using complex disease models with up to 32 loci, incorporating both genetic heterogeneity and multi-locus interaction.
Results
Keeping other factors constant, if risk SNPs interact, the random forest importance measure significantly outperforms the Fisher Exact test as a screening tool. As the number of interacting SNPs increases, the improvement in performance of random forest analysis relative to Fisher Exact test for screening also increases. Random forests perform similarly to the univariate Fisher Exact test as a screening tool when SNPs in the analysis do not interact.
Conclusions
In the context of large-scale genetic association studies where unknown interactions exist among true risk-associated SNPs or SNPs and environmental covariates, screening SNPs using random forest analyses can significantly reduce the number of SNPs that need to be retained for further study compared to standard univariate screening methods.
doi:10.1186/1471-2156-5-32
PMCID: PMC545646  PMID: 15588316
8.  Robust Statistical Methods for Analysis of Biomarkers Measured with Batch/Experiment Specific Errors 
Statistics in medicine  2010;29(3):361-370.
SUMMARY
In many biological studies, biomarkers are measured with errors. In addition, study samples are often divided and measured in separate batches, and data collected from different experiments are used in a single analysis. Generally speaking, the structure of the measurement error is unknown and is not easy to ascertain. While the conditions under which the measurements are taken vary from one batch/experiment to another, they are often held steady within each batch/experiment. Thus, the measurement error can be considered batch/experiment specific, that is, fixed within each batch/experiment, which result into a rank preserving property within each batch/experiment. Under this condition, we study robust statistical methods for analyzing the association between an outcome variable and predictors measured with error, and evaluating the diagnostic or predictive accuracy of these biomarkers. Our methods require no assumptions on the structure and distribution of the measurement error, which are often unrealistic. Compared to existing methods that are predicated on normality and additive structure of measurement errors, our methods still yield valid inferences under departure from these assumptions. The proposed methods are easy to implement using off-shelf software. Simulation studies show that under various measurement error structures, the performance of the proposed methods is satisfactory even for a fairly small sample size, whereas existing methods under misspecified structures and a naive approach exhibited substantial bias. Our methods are illustrated using a biomarker validation case-control study for colorectal neoplasms.
doi:10.1002/sim.3796
PMCID: PMC3177604  PMID: 20020422
batch effect; batch/experiment specific error; measurement error; ROC analysis; surrogate variable
9.  Refining developmental coordination disorder subtyping with multivariate statistical methods 
Background
With a large number of potentially relevant clinical indicators penalization and ensemble learning methods are thought to provide better predictive performance than usual linear predictors. However, little is known about how they perform in clinical studies where few cases are available. We used Random Forests and Partial Least Squares Discriminant Analysis to select the most salient impairments in Developmental Coordination Disorder (DCD) and assess patients similarity.
Methods
We considered a wide-range testing battery for various neuropsychological and visuo-motor impairments which aimed at characterizing subtypes of DCD in a sample of 63 children. Classifiers were optimized on a training sample, and they were used subsequently to rank the 49 items according to a permuted measure of variable importance. In addition, subtyping consistency was assessed with cluster analysis on the training sample. Clustering fitness and predictive accuracy were evaluated on the validation sample.
Results
Both classifiers yielded a relevant subset of items impairments that altogether accounted for a sharp discrimination between three DCD subtypes: ideomotor, visual-spatial and constructional, and mixt dyspraxia. The main impairments that were found to characterize the three subtypes were: digital perception, imitations of gestures, digital praxia, lego blocks, visual spatial structuration, visual motor integration, coordination between upper and lower limbs. Classification accuracy was above 90% for all classifiers, and clustering fitness was found to be satisfactory.
Conclusions
Random Forests and Partial Least Squares Discriminant Analysis are useful tools to extract salient features from a large pool of correlated binary predictors, but also provide a way to assess individuals proximities in a reduced factor space. Less than 15 neuro-visual, neuro-psychomotor and neuro-psychological tests might be required to provide a sensitive and specific diagnostic of DCD on this particular sample, and isolated markers might be used to refine our understanding of DCD in future studies.
doi:10.1186/1471-2288-12-107
PMCID: PMC3464628  PMID: 22834855
10.  EM-random forest and new measures of variable importance for multi-locus quantitative trait linkage analysis 
Bioinformatics  2008;24(14):1603-1610.
Motivation: We developed an EM-random forest (EMRF) for Haseman–Elston quantitative trait linkage analysis that accounts for marker ambiguity and weighs each sib-pair according to the posterior identical by descent (IBD) distribution. The usual random forest (RF) variable importance (VI) index used to rank markers for variable selection is not optimal when applied to linkage data because of correlation between markers. We define new VI indices that borrow information from linked markers using the correlation structure inherent in IBD linkage data.
Results: Using simulations, we find that the new VI indices in EMRF performed better than the original RF VI index and performed similarly or better than EM-Haseman–Elston regression LOD score for various genetic models. Moreover, tree size and markers subset size evaluated at each node are important considerations in RFs.
Availability: The source code for EMRF written in C is available at www.infornomics.utoronto.ca/downloads/EMRF
Contact: bull@mshri.on.ca
Supplementary information: Supplementary data are available at www.infornomics.utoronto.ca/downloads/EMRF
doi:10.1093/bioinformatics/btn239
PMCID: PMC2638262  PMID: 18499695
11.  Measuring inequality: tools and an illustration 
Background
This paper examines an aspect of the problem of measuring inequality in health services. The measures that are commonly applied can be misleading because such measures obscure the difficulty in obtaining a complete ranking of distributions. The nature of the social welfare function underlying these measures is important. The overall object is to demonstrate that varying implications for the welfare of society result from inequality measures.
Method
Various tools for measuring a distribution are applied to some illustrative data on four distributions about mental health services. Although these data refer to this one aspect of health, the exercise is of broader relevance than mental health. The summary measures of dispersion conventionally used in empirical work are applied to the data here, such as the standard deviation, the coefficient of variation, the relative mean deviation and the Gini coefficient. Other, less commonly used measures also are applied, such as Theil's Index of Entropy, Atkinson's Measure (using two differing assumptions about the inequality aversion parameter). Lorenz curves are also drawn for these distributions.
Results
Distributions are shown to have differing rankings (in terms of which is more equal than another), depending on which measure is applied.
Conclusion
The scope and content of the literature from the past decade about health inequalities and inequities suggest that the economic literature from the past 100 years about inequality and inequity may have been overlooked, generally speaking, in the health inequalities and inequity literature. An understanding of economic theory and economic method, partly introduced in this article, is helpful in analysing health inequality and inequity.
doi:10.1186/1475-9276-5-5
PMCID: PMC1550241  PMID: 16716217
12.  Estimating the empirical Lorenz curve and Gini coefficient in the presence of error with nested data 
Statistics in medicine  2008;27(16):3191-3208.
SUMMARY
The Lorenz curve is a graphical tool that is widely used to characterize the concentration of a measure in a population, such as wealth. It is frequently the case that the measure of interest used to rank experimental units when estimating the empirical Lorenz curve, and the corresponding Gini coefficient, is subject to random error. This error can result in an incorrect ranking of experimental units which inevitably leads to a curve that exaggerates the degree of concentration (variation) in the population. We consider a specific data configuration with a hierarchical structure where multiple observations are aggregated within experimental units to form the outcome whose distribution is of interest. Within this context, we explore this bias and discuss several widely available statistical methods that have the potential to reduce or remove the bias in the empirical Lorenz curve. The properties of these methods are examined and compared in a simulation study. This work is motivated by a health outcomes application that seeks to assess the concentration of black patient visits among primary care physicians. The methods are illustrated on data from this study.
doi:10.1002/sim.3151
PMCID: PMC3465674  PMID: 18172873
concentration; distribution; inequality; hierarchical data
13.  Shed some light on darkness: will Tanzania reach the millennium development goals? 
The overall picture of health in sub-Saharan Africa can easily be painted in dark colours. The aim of this viewpoint is to discuss epidemiological data from Tanzania on overall health indicators and the burden of malaria and HIV. Is the situation in Tanzania improving or deteriorating? Are the health-related millennium development goals (MDG) on reducing under-five mortality, reducing maternal mortality and halting HIV and malaria within reach?
Conclusion: Child mortality and infant mortality rates are decreasing quite dramatically. Malaria prevention strategies and new effective treatment are being launched. The MDG 4 on child mortality is clearly within reach, and the same optimism may apply to MDG 6 on combating malaria.
doi:10.1111/j.1651-2227.2007.00293.x
PMCID: PMC1974835  PMID: 17465983
Childhood mortality; HIV; Infant mortality; Maternal mortality; Malaria; Millennium development goal; Tanzania
14.  Malaria control in South Africa 2000–2010: beyond MDG6 
Malaria Journal  2012;11:294.
Background
Malaria is one of the key targets within Goal 6 of the Millennium Development Goals (MDGs), whereby the disease needs to be halted and reversed by the year 2015. Several other international targets have been set, however the MDGs are universally accepted, hence it is the focus of this manuscript.
Methods
An assessment was undertaken to determine the progress South Africa has made against the malaria target of MDG Goal 6. Data were analyzed for the period 2000 until 2010 and verified after municipal boundary changes in some of South Africa’s districts and subsequent to verifying actual residence of malaria positive cases.
Results
South Africa has made significant progress in controlling malaria transmission over the past decade; malaria cases declined by 89.41% (63663 in 2000 vs 6741 in 2010) and deaths decreased by 85.4% (453 vs 66) in the year 2000 compared to the year 2010. Coupled with this, malaria cases among children under five years of age have also declined by 93% (6791 in 2000 vs 451 in 2010). This has resulted in South Africa achieving and exceeding the malaria target of the MDGs. A series of interventions have attributed to this decrease, these include: drug policy change from monotherapy to artemisinin combination therapy, insecticide change from pyrethroids back to DDT; cross border collaboration (South Africa with Mozambique and Swaziland through the Lubombo Spatial Development Initiative– LSDI) and financial investment in malaria control. The KwaZulu-Natal Province has seen the largest reduction in malaria cases and deaths (99.1% cases- 41786 vs 380; and 98.5% deaths 340 vs 5), when comparing the year 2000 with 2010. The Limpopo Province recorded the lowest reduction in malaria cases compared to the other malaria endemic provinces (56.1% reduction- 9487 vs 4174; when comparing 2000 to 2010).
Conclusions
South Africa is well positioned to move beyond the malaria target of the MDGs and progress towards elimination. However, in addition to its existing interventions, the country will need to sustain its financing for malaria control and support programmed reorientation towards elimination and scale up active surveillance coupled with treatment at the community level. Moreover cross-border malaria collaboration needs to be sustained and scaled up to prevent the re-introduction of malaria into the country.
doi:10.1186/1475-2875-11-294
PMCID: PMC3502494  PMID: 22913727
Malaria elimination; South Africa; Vector control; Case management and Millennium Development Goals
15.  Random forests for feature selection in QSPR Models - an application for predicting standard enthalpy of formation of hydrocarbons 
Background
One of the main topics in the development of quantitative structure-property relationship (QSPR) predictive models is the identification of the subset of variables that represent the structure of a molecule and which are predictors for a given property. There are several automated feature selection methods, ranging from backward, forward or stepwise procedures, to further elaborated methodologies such as evolutionary programming. The problem lies in selecting the minimum subset of descriptors that can predict a certain property with a good performance, computationally efficient and in a more robust way, since the presence of irrelevant or redundant features can cause poor generalization capacity. In this paper an alternative selection method, based on Random Forests to determine the variable importance is proposed in the context of QSPR regression problems, with an application to a manually curated dataset for predicting standard enthalpy of formation. The subsequent predictive models are trained with support vector machines introducing the variables sequentially from a ranked list based on the variable importance.
Results
The model generalizes well even with a high dimensional dataset and in the presence of highly correlated variables. The feature selection step was shown to yield lower prediction errors with RMSE values 23% lower than without feature selection, albeit using only 6% of the total number of variables (89 from the original 1485). The proposed approach further compared favourably with other feature selection methods and dimension reduction of the feature space. The predictive model was selected using a 10-fold cross validation procedure and, after selection, it was validated with an independent set to assess its performance when applied to new data and the results were similar to the ones obtained for the training set, supporting the robustness of the proposed approach.
Conclusions
The proposed methodology seemingly improves the prediction performance of standard enthalpy of formation of hydrocarbons using a limited set of molecular descriptors, providing faster and more cost-effective calculation of descriptors by reducing their numbers, and providing a better understanding of the underlying relationship between the molecular structure represented by descriptors and the property of interest.
doi:10.1186/1758-2946-5-9
PMCID: PMC3599435  PMID: 23399299
Feature selection; Variable importance; High dimensional data; Random forests; Data-mining; Property prediction; QSPR; Hybrid methodology
16.  Using Incomplete Citation Data for MEDLINE Results Ranking 
Information overload is a significant problem for modern medicine. Searching MEDLINE for common topics often retrieves more relevant documents than users can review. Therefore, we must identify documents that are not only relevant, but also important. Our system ranks articles using citation counts and the PageRank algorithm, incorporating data from the Science Citation Index. However, citation data is usually incomplete. Therefore, we explore the relationship between the quantity of citation information available to the system and the quality of the result ranking. Specifically, we test the ability of citation count and PageRank to identify “important articles” as defined by experts from large result sets with decreasing citation information. We found that PageRank performs better than simple citation counts, but both algorithms are surprisingly robust to information loss. We conclude that even an incomplete citation database is likely to be effective for importance ranking.
PMCID: PMC1560575  PMID: 16779053
17.  Gene selection and classification of microarray data using random forest 
BMC Bioinformatics  2006;7:3.
Background
Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection.
Results
We investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy.
Conclusion
Because of its performance and features, random forest and gene selection using random forest should probably become part of the "standard tool-box" of methods for class prediction and gene selection with microarray data.
doi:10.1186/1471-2105-7-3
PMCID: PMC1363357  PMID: 16398926
18.  The behaviour of random forest permutation-based variable importance measures under predictor correlation 
BMC Bioinformatics  2010;11:110.
Background
Random forests (RF) have been increasingly used in applications such as genome-wide association and microarray studies where predictor correlation is frequently observed. Recent works on permutation-based variable importance measures (VIMs) used in RF have come to apparently contradictory conclusions. We present an extended simulation study to synthesize results.
Results
In the case when both predictor correlation was present and predictors were associated with the outcome (HA), the unconditional RF VIM attributed a higher share of importance to correlated predictors, while under the null hypothesis that no predictors are associated with the outcome (H0) the unconditional RF VIM was unbiased. Conditional VIMs showed a decrease in VIM values for correlated predictors versus the unconditional VIMs under HA and was unbiased under H0. Scaled VIMs were clearly biased under HA and H0.
Conclusions
Unconditional unscaled VIMs are a computationally tractable choice for large datasets and are unbiased under the null hypothesis. Whether the observed increased VIMs for correlated predictors may be considered a "bias" - because they do not directly reflect the coefficients in the generating model - or if it is a beneficial attribute of these VIMs is dependent on the application. For example, in genetic association studies, where correlation between markers may help to localize the functionally relevant variant, the increased importance of correlated predictors may be an advantage. On the other hand, we show examples where this increased importance may result in spurious signals.
doi:10.1186/1471-2105-11-110
PMCID: PMC2848005  PMID: 20187966
19.  Workforce analysis using data mining and linear regression to understand HIV/AIDS prevalence patterns 
Background
The achievement of the Millennium Development Goals (MDGs) depends on sufficient supply of health workforce in each country. Although country-level data support this contention, it has been difficult to evaluate health workforce supply and MDG outcomes at the country level. The purpose of the study was to examine the association between the health workforce, particularly the nursing workforce, and the achievement of the MDGs, taking into account other factors known to influence health status, such as socioeconomic indicators.
Methods
A merged data set that includes country-level MDG outcomes, workforce statistics, and general socioeconomic indicators was utilized for the present study. Data were obtained from the Global Human Resources for Health Atlas 2004, the WHO Statistical Information System (WHOSIS) 2000, UN Fund for Development and Population Assistance (UNFDPA) 2000, the International Council of Nurses "Nursing in the World", and the WHO/UNAIDS database.
Results
The main factors in understanding HIV/AIDS prevalence rates are physician density followed by female literacy rates and nursing density in the country. Using general linear model approaches, increased physician and nurse density (number of physicians or nurses per population) was associated with lower adult HIV/AIDS prevalence rate, even when controlling for socioeconomic indicators.
Conclusion
Increased nurse and physician density are associated with improved health outcomes, suggesting that countries aiming to attain the MDGs related to HIV/AIDS would do well to invest in their health workforce. Implications for international and country level policy are discussed.
doi:10.1186/1478-4491-6-2
PMCID: PMC2270867  PMID: 18237419
20.  Inequities in utilization of maternal health interventions in Namibia: implications for progress towards MDG 5 targets 
Background
Inequities in the utilization of maternal health services impede progress towards the MDG 5 target of reducing the maternal mortality ratio by three quarters, between 1990 and 2015. In Namibia, despite increasing investments in the health sector, the maternal mortality ratio has increased from 271 per 100,000 live births in the period 1991-2000 to 449 per 100,000 live births in 1998-2007. Monitoring equity in the use of maternal health services is important to target scarce resources to those with more need and expedite the progress towards the MDG 5 target. The objective of this study is to measure socio-economic inequalities in access to maternal health services and propose recommendations relevant for policy and planning.
Methods
Data from the Namibia Demographic and Health Survey 2006-07 are analyzed for inequities in the utilization of maternal health. In measuring the inequities, rate-ratios, concentration curves and concentration indices are used.
Results
Regions with relatively high human development index have the highest rates of delivery by skilled health service providers. The rate of caesarean section in women with post secondary education is about seven times that of women with no education. Women in urban areas are delivered by skilled providers 30% more than their rural counterparts. The rich use the public health facilities 30% more than the poor for child delivery.
Conclusion
Most of the indicators such as delivery by trained health providers, delivery by caesarean section and postnatal care show inequities favoring the most educated, urban areas, regions with high human development indices and the wealthy. In the presence of inequities, it is difficult to achieve a significant reduction in the maternal mortality ratio needed to realize the MDG 5 targets so long as a large segment of society has inadequate access to essential maternal health services and other basic social services. Addressing inequities in access to maternal health services should not only be seen as a health systems issue. The social determinants of health have to be tackled through multi-sectoral approaches in line with the principles of Primary Health Care and the recommendations of the Commission on Social Determinants of Health.
doi:10.1186/1475-9276-9-16
PMCID: PMC2898738  PMID: 20540793
21.  Maternal effects on the development of social rank and immunity trade-offs in male laboratory mice (Mus musculus). 
Social status in randomly constituted groups of male CFLP mice was predictable from early suckling behaviour and rate of weight gain in natal litters. High-ranking males were those that had suckled on more anterior teats and gained weight more quickly. Rank was not predicted by any measures of sibling interaction or hormone (testosterone, corticosterone) concentration. Aggressiveness in eventual high-rankers was associated negatively with the proportion of males in the litter at birth and the amount of maternal attention received. Aggressive social relationships within natal litters did not predict polarized rank relationships in randomized groups. Nevertheless, while still in their natal litters, and in the absence of aggressive rank relationships, eventual rank categories showed the same difference in modulation of testosterone concentration in relation to current immunocompetence (low-rankers modulating, high-rankers not), as has repeatedly been found in randomized groups by earlier studies. The role of maternal condition in determining rank-related life-history development in male mice is discussed.
PMCID: PMC1689489  PMID: 9842735
22.  Strength of Tobacco Control in Rural Communities 
Purpose
This study aimed to: (a) describe the Strength of Tobacco Control (SoTC) capacity, efforts and resources in rural communities, and (b) examine the relationships between SoTC scores and sociodemographic, political and health-ranking variables.
Methods
Data were collected during the baseline pre-intervention phase of a community-based randomized, controlled trial. Rural counties were selected using stratified random sampling (n = 39). Key informant interviews were employed. The SoTC, originally developed and tested with states, was adapted to a county-level measure assessing capacity, efforts, and resources. Univariate analysis and bivariate correlations assessed the SoTC total score and construct scores, as well as their relationships. Multiple regression examined the relationships of county-level sociodemographic, political and health-ranking variables with SoTC total and construct scores.
Findings
County population size was positively correlated with capacity (r = 0.44; P < .01), efforts (r = 0.54; P = .01) and SoTC total score (r = 0.51; P < .01). Communities with more resources for tobacco control had better overall county health rankings (r = .43; P < .01). With population size, percent Caucasian, tobacco production, and smoking prevalence as potential predictors of SoTC total score, only population size was significant.
Conclusions
SoTC scores may be useful in determining local tobacco control efforts and appropriate planning for additional public health interventions and resources. Larger rural communities were more likely to have strong tobacco control programs than smaller communities. Smaller rural communities may need to be targeted for training and technical assistance. Leadership development and allocation of resources are needed in all rural communities to address disparities in tobacco use and tobacco control policies.
doi:10.1111/j.1748-0361.2010.00273.x
PMCID: PMC2948793  PMID: 20446998
tobacco control; strength of tobacco control; environmental tobacco smoke pollution; rural communities
23.  Conditional variable importance for random forests 
BMC Bioinformatics  2008;9:307.
Background
Random forests are becoming increasingly popular in many scientific fields because they can cope with "small n large p" problems, complex interactions and even highly correlated predictor variables. Their variable importance measures have recently been suggested as screening tools for, e.g., gene expression studies. However, these variable importance measures show a bias towards correlated predictor variables.
Results
We identify two mechanisms responsible for this finding: (i) A preference for the selection of correlated predictors in the tree building process and (ii) an additional advantage for correlated predictor variables induced by the unconditional permutation scheme that is employed in the computation of the variable importance measure. Based on these considerations we develop a new, conditional permutation scheme for the computation of the variable importance measure.
Conclusion
The resulting conditional variable importance reflects the true impact of each predictor variable more reliably than the original marginal approach.
doi:10.1186/1471-2105-9-307
PMCID: PMC2491635  PMID: 18620558
24.  Developing a summary hospital mortality index: retrospective analysis in English hospitals over five years 
Objectives To develop a transparent and reproducible measure for hospitals that can indicate when deaths in hospital or within 30 days of discharge are high relative to other hospitals, given the characteristics of the patients in that hospital, and to investigate those factors that have the greatest effect in changing the rank of a hospital, whether interactions exist between those factors, and the stability of the measure over time.
Design Retrospective cross sectional study of admissions to English hospitals.
Setting Hospital episode statistics for England from 1 April 2005 to 30 September 2010, with linked mortality data from the Office for National Statistics.
Participants 36.5 million completed hospital admissions in 146 general and 72 specialist trusts.
Main outcome measures Deaths within hospital or within 30 days of discharge from hospital.
Results The predictors that were used in the final model comprised admission diagnosis, age, sex, type of admission, and comorbidity. The percentage of people admitted who died in hospital or within 30 days of discharge was 4.2% for males and 4.5% for females. Emergency admissions comprised 75% of all admissions and 5.5% died, in contrast to 0.8% who died after an elective admission. The percentage who died with a Charlson comorbidity score of 0 was 2% in contrast with 15% who died with a score greater than 5. Given these variables, the relative standardised mortality rates of the hospitals were not noticeably changed by adjusting for the area level deprivation and number of previous emergency visits to hospital. There was little evidence that including interaction terms changed the relative values by any great amount. Using these predictors the summary hospital mortality index (SHMI) was derived. For 2007/8 the model had a C statistic of 0.911 and accounted for 81% of the variability of between hospital mortality. A random effects funnel plot was used to identify outlying hospitals. The outliers from the SHMI over the period 2005-10 have previously been identified using other mortality indicators.
Conclusion The SHMI is a relatively simple tool that can be used in conjunction with other information to identify hospitals that may need further investigation.
doi:10.1136/bmj.e1001
PMCID: PMC3291118  PMID: 22381521
25.  PROGNOSTIC VARIABLES IN SCHIZOPHRENIA* 
Indian Journal of Psychiatry  1989;31(1):51-62.
SUMMARY
This study examined the relationship between measures of outcome and socio-demographic arid diagnostic variables in schizophrenia. Product moment co-efficient of correlation and stepwise multiple regression were the main statistical techniques of analyses. The results of the study indicate that DSM-III diagnosis of schizophrenia, duration of illness, and Present State Examination-PSE Syndrome of non-specific psychosis are important predictors of outcome. CATEGO and Research Diagnostic Criteria-RDC diagnosis of Schizophrenia, and Schneiderian First Rank Symptoms were found to be poor predictors of outcome. Socio-demographic and clinical variables like gender of the patient, place of origin, impersistence at work, poor premorbid work record, hospitalization at the time of admittance into the study, loss of interest, affective flattening and incoherent speech were found to have prognostic implications.
PMCID: PMC2990871  PMID: 21927358

Results 1-25 (775464)