The random forest (RF) method is a commonly used tool for classification with
high dimensional data as well as for ranking candidate predictors based on
the so-called random forest variable importance measures (VIMs). However the
classification performance of RF is known to be suboptimal in case of
strongly unbalanced data, i.e. data where response class sizes differ
considerably. Suggestions were made to obtain better classification
performance based either on sampling procedures or on cost sensitivity
analyses. However to our knowledge the performance of the VIMs has not yet
been examined in the case of unbalanced response classes. In this paper we
explore the performance of the permutation VIM for unbalanced data settings
and introduce an alternative permutation VIM based on the area under the
curve (AUC) that is expected to be more robust towards class imbalance.
We investigated the performance of the standard permutation VIM and of our
novel AUC-based permutation VIM for different class imbalance levels using
simulated data and real data. The results suggest that the new AUC-based
permutation VIM outperforms the standard permutation VIM for unbalanced data
settings while both permutation VIMs have equal performance for balanced
The standard permutation VIM loses its ability to discriminate between
associated predictors and predictors not associated with the response for
increasing class imbalance. It is outperformed by our new AUC-based
permutation VIM for unbalanced data settings, while the performance of both
VIMs is very similar in the case of balanced classes. The new AUC-based VIM
is implemented in the R package party for the unbiased RF variant based on
conditional inference trees. The codes implementing our study are available
from the companion website:
Random forest; Conditional inference trees; Variable importance measure; Feature selection; Unbalanced data; Class imbalance; Area under the curve.
With the date for achieving the targets of the Millennium Development Goals (MDGs) approaching fast, there is a heightened concern about equity, as inequities hamper progress towards the MDGs. Equity-focused approaches have the potential to accelerate the progress towards achieving the health-related MDGs faster than the current pace in a more cost-effective and sustainable manner. Ghana's rate of progress towards MDGs 4 and 5 related to reducing child and maternal mortality respectively is less than what is required to achieve the targets. The objective of this paper is to examine the equity dimension of child and maternal health outcomes and interventions using Ghana as a case study.
Data from Ghana Demographic and Health Survey 2008 report is analyzed for inequities in selected maternal and child health outcomes and interventions using population-weighted, regression-based measures: slope index of inequality and relative index of inequality.
No statistically significant inequities are observed in infant and under-five mortality, perinatal mortality, wasting and acute respiratory infection in children. However, stunting, underweight in under-five children, anaemia in children and women, childhood diarrhoea and underweight in women (BMI < 18.5) show inequities that are to the disadvantage of the poorest. The rates significantly decrease among the wealthiest quintile as compared to the poorest. In contrast, overweight (BMI 25-29.9) and obesity (BMI ≥ 30) among women reveals a different trend - there are inequities in favour of the poorest. In other words, in Ghana overweight and obesity increase significantly among women in the wealthiest quintile compared to the poorest. With respect to interventions: treatment of diarrhoea in children, receiving all basic vaccines among children and sleeping under ITN (children and pregnant women) have no wealth-related gradient. Skilled care at birth, deliveries in a health facility (both public and private), caesarean section, use of modern contraceptives and intermittent preventive treatment for malaria during pregnancy all indicate gradients that are in favour of the wealthiest. The poorest use less of these interventions. Not unexpectedly, there is more use of home delivery among women of the poorest quintile.
Significant Inequities are observed in many of the selected child and maternal health outcomes and interventions. Failure to address these inequities vigorously is likely to lead to non-achievement of the MDG targets related to improving child and maternal health (MDGs 4 and 5). The government should therefore give due attention to tackling inequities in health outcomes and use of interventions by implementing equity-enhancing measure both within and outside the health sector in line with the principles of Primary Health Care and the recommendations of the WHO Commission on Social Determinants of Health.
Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories.
Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand.
We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analyzing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research.
Identifying variants associated with complex human traits in high-dimensional data is a central goal of genome-wide association studies. However, complicated etiologies such as gene-gene interactions are ignored by the univariate analysis usually applied in these studies. Random Forests (RF) are a popular data-mining technique that can accommodate a large number of predictor variables and allow for complex models with interactions. RF analysis produces measures of variable importance that can be used to rank the predictor variables. Thus, single nucleotide polymorphism (SNP) analysis using RFs is gaining popularity as a potential filter approach that considers interactions in high-dimensional data. However, the impact of data dimensionality on the power of RF to identify interactions has not been thoroughly explored. We investigate the ability of rankings from variable importance measures to detect gene-gene interaction effects and their potential effectiveness as filters compared to p-values from univariate logistic regression, particularly as the data becomes increasingly high-dimensional.
RF effectively identifies interactions in low dimensional data. As the total number of predictor variables increases, probability of detection declines more rapidly for interacting SNPs than for non-interacting SNPs, indicating that in high-dimensional data the RF variable importance measures are capturing marginal effects rather than capturing the effects of interactions.
While RF remains a promising data-mining technique that extends univariate methods to condition on multiple variables simultaneously, RF variable importance measures fail to detect interaction effects in high-dimensional data in the absence of a strong marginal component, and therefore may not be useful as a filter technique that allows for interaction effects in genome-wide data.
Environmental protection is critical to maintain ecosystem services essential for human well-being. It is important to be able to rank countries by their environmental impact so that poor performers as well as policy ‘models’ can be identified. We provide novel metrics of country-specific environmental impact ranks – one proportional to total resource availability per country and an absolute (total) measure of impact – that explicitly avoid incorporating confounding human health or economic indicators. Our rankings are based on natural forest loss, habitat conversion, marine captures, fertilizer use, water pollution, carbon emissions and species threat, although many other variables were excluded due to a lack of country-specific data. Of 228 countries considered, 179 (proportional) and 171 (absolute) had sufficient data for correlations. The proportional index ranked Singapore, Korea, Qatar, Kuwait, Japan, Thailand, Bahrain, Malaysia, Philippines and Netherlands as having the highest proportional environmental impact, whereas Brazil, USA, China, Indonesia, Japan, Mexico, India, Russia, Australia and Peru had the highest absolute impact (i.e., total resource use, emissions and species threatened). Proportional and absolute environmental impact ranks were correlated, with mainly Asian countries having both high proportional and absolute impact. Despite weak concordance among the drivers of environmental impact, countries often perform poorly for different reasons. We found no evidence to support the environmental Kuznets curve hypothesis of a non-linear relationship between impact and per capita wealth, although there was a weak reduction in environmental impact as per capita wealth increases. Using structural equation models to account for cross-correlation, we found that increasing wealth was the most important driver of environmental impact. Our results show that the global community not only has to encourage better environmental performance in less-developed countries, especially those in Asia, there is also a requirement to focus on the development of environmentally friendly practices in wealthier countries.
We study the number of causal variants and associated regions identified by top SNPs in rankings given by the popular 1 df chi-squared statistic, support vector machine (SVM) and the random forest (RF) on simulated and real data. If we apply the SVM and RF to the top 2r chi-square-ranked SNPs, where r is the number of SNPs with P-values within the Bonferroni correction, we find that both improve the ranks of causal variants and associated regions and achieve higher power on simulated data. These improvements, however, as well as stability of the SVM and RF rankings, progressively decrease as the cutoff increases to 5r and 10r. As applications we compare the ranks of previously replicated SNPs in real data, associated regions in type 1 diabetes, as provided by the Type 1 Diabetes Consortium, and disease risk prediction accuracies as given by top ranked SNPs by the three methods. Software and webserver are available at http://svmsnps.njit.edu.
Risk of complex disorders is thought to be multifactorial, involving interactions between risk factors. However, many genetic studies assess association between disease status and markers one single-nucleotide polymorphism (SNP) at a time, due to the high-dimensional nature of the search space of all possible interactions. Three ensemble methods have been recently proposed for use in high-dimensional data (Monte Carlo logic regression, random forests, and generalized boosted regression). An intuitive way to detect an association between genetic markers and disease status is to use variable importance measures, even though the stability of these measures in the context of a whole-genome association study is unknown. For the simulated data of Problem 3 in the Genetic Analysis Workshop 15 (GAW15), we examined the variability of both rankings and magnitude of variable importance measures using 10 variables simulated to participate in gene × gene and gene × environment interactions. We conducted 500 analyses per method on one randomly selected replicate, tallying the rankings and importance measures for each of the 10 variables of interest. When the simulated effect size was strong, all three methods showed stable rankings and estimates of variable importance. However, under conditions more commonly expected to be encountered in complex diseases, random forests and generalized boosted regression showed more stable estimates of variable importance and variable rankings. Individuals endeavoring to apply statistical learning methods to detect interaction in complex disease studies should perform repeated analyses in order to assure variable importance measures and rankings do not vary greatly, even for statistical learning algorithms that are thought to be stable.
Genome-wide association studies for complex diseases will produce genotypes on hundreds of thousands of single nucleotide polymorphisms (SNPs). A logical first approach to dealing with massive numbers of SNPs is to use some test to screen the SNPs, retaining only those that meet some criterion for futher study. For example, SNPs can be ranked by p-value, and those with the lowest p-values retained. When SNPs have large interaction effects but small marginal effects in a population, they are unlikely to be retained when univariate tests are used for screening. However, model-based screens that pre-specify interactions are impractical for data sets with thousands of SNPs. Random forest analysis is an alternative method that produces a single measure of importance for each predictor variable that takes into account interactions among variables without requiring model specification. Interactions increase the importance for the individual interacting variables, making them more likely to be given high importance relative to other variables. We test the performance of random forests as a screening procedure to identify small numbers of risk-associated SNPs from among large numbers of unassociated SNPs using complex disease models with up to 32 loci, incorporating both genetic heterogeneity and multi-locus interaction.
Keeping other factors constant, if risk SNPs interact, the random forest importance measure significantly outperforms the Fisher Exact test as a screening tool. As the number of interacting SNPs increases, the improvement in performance of random forest analysis relative to Fisher Exact test for screening also increases. Random forests perform similarly to the univariate Fisher Exact test as a screening tool when SNPs in the analysis do not interact.
In the context of large-scale genetic association studies where unknown interactions exist among true risk-associated SNPs or SNPs and environmental covariates, screening SNPs using random forest analyses can significantly reduce the number of SNPs that need to be retained for further study compared to standard univariate screening methods.
One of the main topics in the development of quantitative structure-property relationship (QSPR) predictive models is the identification of the subset of variables that represent the structure of a molecule and which are predictors for a given property. There are several automated feature selection methods, ranging from backward, forward or stepwise procedures, to further elaborated methodologies such as evolutionary programming. The problem lies in selecting the minimum subset of descriptors that can predict a certain property with a good performance, computationally efficient and in a more robust way, since the presence of irrelevant or redundant features can cause poor generalization capacity. In this paper an alternative selection method, based on Random Forests to determine the variable importance is proposed in the context of QSPR regression problems, with an application to a manually curated dataset for predicting standard enthalpy of formation. The subsequent predictive models are trained with support vector machines introducing the variables sequentially from a ranked list based on the variable importance.
The model generalizes well even with a high dimensional dataset and in the presence of highly correlated variables. The feature selection step was shown to yield lower prediction errors with RMSE values 23% lower than without feature selection, albeit using only 6% of the total number of variables (89 from the original 1485). The proposed approach further compared favourably with other feature selection methods and dimension reduction of the feature space. The predictive model was selected using a 10-fold cross validation procedure and, after selection, it was validated with an independent set to assess its performance when applied to new data and the results were similar to the ones obtained for the training set, supporting the robustness of the proposed approach.
The proposed methodology seemingly improves the prediction performance of standard enthalpy of formation of hydrocarbons using a limited set of molecular descriptors, providing faster and more cost-effective calculation of descriptors by reducing their numbers, and providing a better understanding of the underlying relationship between the molecular structure represented by descriptors and the property of interest.
Feature selection; Variable importance; High dimensional data; Random forests; Data-mining; Property prediction; QSPR; Hybrid methodology
In many biological studies, biomarkers are measured with errors. In addition, study samples are often divided and measured in separate batches, and data collected from different experiments are used in a single analysis. Generally speaking, the structure of the measurement error is unknown and is not easy to ascertain. While the conditions under which the measurements are taken vary from one batch/experiment to another, they are often held steady within each batch/experiment. Thus, the measurement error can be considered batch/experiment specific, that is, fixed within each batch/experiment, which result into a rank preserving property within each batch/experiment. Under this condition, we study robust statistical methods for analyzing the association between an outcome variable and predictors measured with error, and evaluating the diagnostic or predictive accuracy of these biomarkers. Our methods require no assumptions on the structure and distribution of the measurement error, which are often unrealistic. Compared to existing methods that are predicated on normality and additive structure of measurement errors, our methods still yield valid inferences under departure from these assumptions. The proposed methods are easy to implement using off-shelf software. Simulation studies show that under various measurement error structures, the performance of the proposed methods is satisfactory even for a fairly small sample size, whereas existing methods under misspecified structures and a naive approach exhibited substantial bias. Our methods are illustrated using a biomarker validation case-control study for colorectal neoplasms.
batch effect; batch/experiment specific error; measurement error; ROC analysis; surrogate variable
With a large number of potentially relevant clinical indicators penalization and ensemble learning methods are thought to provide better predictive performance than usual linear predictors. However, little is known about how they perform in clinical studies where few cases are available. We used Random Forests and Partial Least Squares Discriminant Analysis to select the most salient impairments in Developmental Coordination Disorder (DCD) and assess patients similarity.
We considered a wide-range testing battery for various neuropsychological and visuo-motor impairments which aimed at characterizing subtypes of DCD in a sample of 63 children. Classifiers were optimized on a training sample, and they were used subsequently to rank the 49 items according to a permuted measure of variable importance. In addition, subtyping consistency was assessed with cluster analysis on the training sample. Clustering fitness and predictive accuracy were evaluated on the validation sample.
Both classifiers yielded a relevant subset of items impairments that altogether accounted for a sharp discrimination between three DCD subtypes: ideomotor, visual-spatial and constructional, and mixt dyspraxia. The main impairments that were found to characterize the three subtypes were: digital perception, imitations of gestures, digital praxia, lego blocks, visual spatial structuration, visual motor integration, coordination between upper and lower limbs. Classification accuracy was above 90% for all classifiers, and clustering fitness was found to be satisfactory.
Random Forests and Partial Least Squares Discriminant Analysis are useful tools to extract salient features from a large pool of correlated binary predictors, but also provide a way to assess individuals proximities in a reduced factor space. Less than 15 neuro-visual, neuro-psychomotor and neuro-psychological tests might be required to provide a sensitive and specific diagnostic of DCD on this particular sample, and isolated markers might be used to refine our understanding of DCD in future studies.
Rank based tests are alternatives to likelihood based tests popularized by their relative robustness and underlying elegant mathematical theory. There has been a serge in research activities in this area in recent years since a number of researchers are working to develop and extend rank based procedures to clustered dependent data which include situations with known correlation structures (e.g., as in mixed effects models) as well as more general form of dependence.
The purpose of this paper is to test the symmetry of a marginal distribution under clustered data. However, unlike most other papers in the area, we consider the possibility that the cluster size is a random variable whose distribution is dependent on the distribution of the variable of interest within a cluster. This situation typically arises when the clusters are defined in a natural way (e.g., not controlled by the experimenter or statistician) and in which the size of the cluster may carry information about the distribution of data values within a cluster.
Under the scenario of an informative cluster size, attempts to use some form of variance adjusted sign or signed rank tests would fail since they would not maintain the correct size under the distribution of marginal symmetry. To overcome this difficulty Datta and Satten (2008; Biometrics, 64, 501–507) proposed a Wilcoxon type signed rank test based on the principle of within cluster resampling. In this paper we study this problem in more generality by introducing a class of valid tests employing a general score function. Asymptotic null distribution of these tests is obtained. A simulation study shows that a more general choice of the score function can sometimes result in greater power than the Datta and Satten test; furthermore, this development offers the user a wider choice. We illustrate our tests using a real data example on spinal cord injury patients.
nonparametric tests; clustered data; signed rank test; informative cluster size; symmetry
Motivation: We developed an EM-random forest (EMRF) for Haseman–Elston quantitative trait linkage analysis that accounts for marker ambiguity and weighs each sib-pair according to the posterior identical by descent (IBD) distribution. The usual random forest (RF) variable importance (VI) index used to rank markers for variable selection is not optimal when applied to linkage data because of correlation between markers. We define new VI indices that borrow information from linked markers using the correlation structure inherent in IBD linkage data.
Results: Using simulations, we find that the new VI indices in EMRF performed better than the original RF VI index and performed similarly or better than EM-Haseman–Elston regression LOD score for various genetic models. Moreover, tree size and markers subset size evaluated at each node are important considerations in RFs.
Availability: The source code for EMRF written in C is available at www.infornomics.utoronto.ca/downloads/EMRF
Supplementary information: Supplementary data are available at www.infornomics.utoronto.ca/downloads/EMRF
This paper examines an aspect of the problem of measuring inequality in health services. The measures that are commonly applied can be misleading because such measures obscure the difficulty in obtaining a complete ranking of distributions. The nature of the social welfare function underlying these measures is important. The overall object is to demonstrate that varying implications for the welfare of society result from inequality measures.
Various tools for measuring a distribution are applied to some illustrative data on four distributions about mental health services. Although these data refer to this one aspect of health, the exercise is of broader relevance than mental health. The summary measures of dispersion conventionally used in empirical work are applied to the data here, such as the standard deviation, the coefficient of variation, the relative mean deviation and the Gini coefficient. Other, less commonly used measures also are applied, such as Theil's Index of Entropy, Atkinson's Measure (using two differing assumptions about the inequality aversion parameter). Lorenz curves are also drawn for these distributions.
Distributions are shown to have differing rankings (in terms of which is more equal than another), depending on which measure is applied.
The scope and content of the literature from the past decade about health inequalities and inequities suggest that the economic literature from the past 100 years about inequality and inequity may have been overlooked, generally speaking, in the health inequalities and inequity literature. An understanding of economic theory and economic method, partly introduced in this article, is helpful in analysing health inequality and inequity.
Random forests (RF) have been increasingly used in applications such as genome-wide association and microarray studies where predictor correlation is frequently observed. Recent works on permutation-based variable importance measures (VIMs) used in RF have come to apparently contradictory conclusions. We present an extended simulation study to synthesize results.
In the case when both predictor correlation was present and predictors were associated with the outcome (HA), the unconditional RF VIM attributed a higher share of importance to correlated predictors, while under the null hypothesis that no predictors are associated with the outcome (H0) the unconditional RF VIM was unbiased. Conditional VIMs showed a decrease in VIM values for correlated predictors versus the unconditional VIMs under HA and was unbiased under H0. Scaled VIMs were clearly biased under HA and H0.
Unconditional unscaled VIMs are a computationally tractable choice for large datasets and are unbiased under the null hypothesis. Whether the observed increased VIMs for correlated predictors may be considered a "bias" - because they do not directly reflect the coefficients in the generating model - or if it is a beneficial attribute of these VIMs is dependent on the application. For example, in genetic association studies, where correlation between markers may help to localize the functionally relevant variant, the increased importance of correlated predictors may be an advantage. On the other hand, we show examples where this increased importance may result in spurious signals.
The Lorenz curve is a graphical tool that is widely used to characterize the concentration of a measure in a population, such as wealth. It is frequently the case that the measure of interest used to rank experimental units when estimating the empirical Lorenz curve, and the corresponding Gini coefficient, is subject to random error. This error can result in an incorrect ranking of experimental units which inevitably leads to a curve that exaggerates the degree of concentration (variation) in the population. We consider a specific data configuration with a hierarchical structure where multiple observations are aggregated within experimental units to form the outcome whose distribution is of interest. Within this context, we explore this bias and discuss several widely available statistical methods that have the potential to reduce or remove the bias in the empirical Lorenz curve. The properties of these methods are examined and compared in a simulation study. This work is motivated by a health outcomes application that seeks to assess the concentration of black patient visits among primary care physicians. The methods are illustrated on data from this study.
concentration; distribution; inequality; hierarchical data
The construction of the components of Partial Least Squares (PLS) is based on the maximization of the covariance/correlation between linear combinations of the predictors and the response. However, the usual Pearson correlation is influenced by outliers in the response or in the predictors. To cope with outliers, we replace the Pearson correlation with the Spearman rank correlation in the optimization criteria of PLS. The rank-based method of PLS is insensitive to outlying values in both the predictors and response, and incorporates the censoring information by using an approach of Nguyen and Rocke (2004) and two approaches of reweighting and mean imputation of Datta et al. (2007). The performance of the rank-based approaches of PLS, denoted by Rank-based Modified Partial Least Squares (RMPLS), Rank-based Reweighted Partial Least Squares (RRWPLS), and Rank-based Mean-Imputation Partial Least Squares (RMIPLS), is investigated in a simulation study and on four real datasets, under an Accelerated Failure Time (AFT) model, against their un-ranked counterparts, and several other dimension reduction techniques. The results indicate that RMPLS is a better dimension reduction method than other variants of PLS as well as other considered methods in terms of the minimized cross-validation error of fit and the mean squared error of fit in the presence of outliers in the response, and is comparable to other variants of PLS in the absence of outliers.
rank-based PLS; dimension reduction; censored response; outliers
The overall picture of health in sub-Saharan Africa can easily be painted in dark colours. The aim of this viewpoint is to discuss epidemiological data from Tanzania on overall health indicators and the burden of malaria and HIV. Is the situation in Tanzania improving or deteriorating? Are the health-related millennium development goals (MDG) on reducing under-five mortality, reducing maternal mortality and halting HIV and malaria within reach?
Conclusion: Child mortality and infant mortality rates are decreasing quite dramatically. Malaria prevention strategies and new effective treatment are being launched. The MDG 4 on child mortality is clearly within reach, and the same optimism may apply to MDG 6 on combating malaria.
Childhood mortality; HIV; Infant mortality; Maternal mortality; Malaria; Millennium development goal; Tanzania
Malaria is one of the key targets within Goal 6 of the Millennium Development Goals (MDGs), whereby the disease needs to be halted and reversed by the year 2015. Several other international targets have been set, however the MDGs are universally accepted, hence it is the focus of this manuscript.
An assessment was undertaken to determine the progress South Africa has made against the malaria target of MDG Goal 6. Data were analyzed for the period 2000 until 2010 and verified after municipal boundary changes in some of South Africa’s districts and subsequent to verifying actual residence of malaria positive cases.
South Africa has made significant progress in controlling malaria transmission over the past decade; malaria cases declined by 89.41% (63663 in 2000 vs 6741 in 2010) and deaths decreased by 85.4% (453 vs 66) in the year 2000 compared to the year 2010. Coupled with this, malaria cases among children under five years of age have also declined by 93% (6791 in 2000 vs 451 in 2010). This has resulted in South Africa achieving and exceeding the malaria target of the MDGs. A series of interventions have attributed to this decrease, these include: drug policy change from monotherapy to artemisinin combination therapy, insecticide change from pyrethroids back to DDT; cross border collaboration (South Africa with Mozambique and Swaziland through the Lubombo Spatial Development Initiative– LSDI) and financial investment in malaria control. The KwaZulu-Natal Province has seen the largest reduction in malaria cases and deaths (99.1% cases- 41786 vs 380; and 98.5% deaths 340 vs 5), when comparing the year 2000 with 2010. The Limpopo Province recorded the lowest reduction in malaria cases compared to the other malaria endemic provinces (56.1% reduction- 9487 vs 4174; when comparing 2000 to 2010).
South Africa is well positioned to move beyond the malaria target of the MDGs and progress towards elimination. However, in addition to its existing interventions, the country will need to sustain its financing for malaria control and support programmed reorientation towards elimination and scale up active surveillance coupled with treatment at the community level. Moreover cross-border malaria collaboration needs to be sustained and scaled up to prevent the re-introduction of malaria into the country.
Malaria elimination; South Africa; Vector control; Case management and Millennium Development Goals
Information overload is a significant problem for modern medicine. Searching
MEDLINE for common topics often retrieves more relevant documents
than users can review. Therefore, we must identify documents that are
not only relevant, but also important. Our system ranks articles using
citation counts and the PageRank algorithm, incorporating data from
the Science Citation Index. However, citation data is usually incomplete. Therefore, we
explore the relationship between the quantity of citation
information available to the system and the quality of the result
ranking. Specifically, we test the ability of citation count and PageRank
to identify “important articles” as defined by
experts from large result sets with decreasing citation information. We
found that PageRank performs better than simple citation counts, but
both algorithms are surprisingly robust to information loss. We conclude
that even an incomplete citation database is likely to be effective
for importance ranking.
This study aimed to: (a) describe the Strength of Tobacco Control (SoTC) capacity, efforts and resources in rural communities, and (b) examine the relationships between SoTC scores and sociodemographic, political and health-ranking variables.
Data were collected during the baseline pre-intervention phase of a community-based randomized, controlled trial. Rural counties were selected using stratified random sampling (n = 39). Key informant interviews were employed. The SoTC, originally developed and tested with states, was adapted to a county-level measure assessing capacity, efforts, and resources. Univariate analysis and bivariate correlations assessed the SoTC total score and construct scores, as well as their relationships. Multiple regression examined the relationships of county-level sociodemographic, political and health-ranking variables with SoTC total and construct scores.
County population size was positively correlated with capacity (r = 0.44; P < .01), efforts (r = 0.54; P = .01) and SoTC total score (r = 0.51; P < .01). Communities with more resources for tobacco control had better overall county health rankings (r = .43; P < .01). With population size, percent Caucasian, tobacco production, and smoking prevalence as potential predictors of SoTC total score, only population size was significant.
SoTC scores may be useful in determining local tobacco control efforts and appropriate planning for additional public health interventions and resources. Larger rural communities were more likely to have strong tobacco control programs than smaller communities. Smaller rural communities may need to be targeted for training and technical assistance. Leadership development and allocation of resources are needed in all rural communities to address disparities in tobacco use and tobacco control policies.
tobacco control; strength of tobacco control; environmental tobacco smoke pollution; rural communities
Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection.
We investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy.
Because of its performance and features, random forest and gene selection using random forest should probably become part of the "standard tool-box" of methods for class prediction and gene selection with microarray data.
Random forests are becoming increasingly popular in many scientific fields because they can cope with "small n large p" problems, complex interactions and even highly correlated predictor variables. Their variable importance measures have recently been suggested as screening tools for, e.g., gene expression studies. However, these variable importance measures show a bias towards correlated predictor variables.
We identify two mechanisms responsible for this finding: (i) A preference for the selection of correlated predictors in the tree building process and (ii) an additional advantage for correlated predictor variables induced by the unconditional permutation scheme that is employed in the computation of the variable importance measure. Based on these considerations we develop a new, conditional permutation scheme for the computation of the variable importance measure.
The resulting conditional variable importance reflects the true impact of each predictor variable more reliably than the original marginal approach.
OBJECTIVE: Researchers, government, and the press often rank jurisdictions according to public health indicators; however, measures of uncertainty rarely accompany these comparisons. To demonstrate the variability associated with rankings that use public health measures, the authors examined the uncertainty associated with ranks based on three common methods used to derive public health indicators: age-adjustment, calculations based on census estimates, and calculations based on survey data. METHODS: The authors observed the effect of changing the standard population from the 1970 population to the 1997 population on rank-order lists of jurisdictions according to age-adjusted 1998 mortality rates. They used a Monte Carlo method to calculate confidence intervals (CIs) around ranks based on census estimates of 1998 infant mortality rates and based on 1999 Behavioral Risk Factor Surveillance System (BRFSS) survey data on the prevalence of hypertension. RESULTS: Changing the standard year from 1970 to 1997 resulted in a shift of at least three rank-order positions for seven states. Two states shifted five positions. CIs associated with ranking by infant mortality rates were broad, with a mean of 16 ranks. CIs around ranks for the prevalence of hypertension were also wide, with a mean of 18 ranks. CONCLUSION: While ranking based on public health indicators is an attractive and popular way of presenting public health data, caution and close examination of the underlying data are needed for proper interpretation. Alternative methods, such as longitudinal analysis or comparisons with standards, may prove more useful.
Ranked gene lists from microarray experiments are usually analysed by assigning significance to predefined gene categories, e.g., based on functional annotations. Tools performing such analyses are often restricted to a category score based on a cutoff in the ranked list and a significance calculation based on random gene permutations as null hypothesis.
We analysed three publicly available data sets, in each of which samples were divided in two classes and genes ranked according to their correlation to class labels. We developed a program, Catmap (available for download at ), to compare different scores and null hypotheses in gene category analysis, using Gene Ontology annotations for category definition. When a cutoff-based score was used, results depended strongly on the choice of cutoff, introducing an arbitrariness in the analysis. Comparing results using random gene permutations and random sample permutations, respectively, we found that the assigned significance of a category depended strongly on the choice of null hypothesis. Compared to sample label permutations, gene permutations gave much smaller p-values for large categories with many coexpressed genes.
In gene category analyses of ranked gene lists, a cutoff independent score is preferable. The choice of null hypothesis is very important; random gene permutations does not work well as an approximation to sample label permutations.