Searching for relevant data was conducted within 19 electronic libraries/databases. Articles were also searched for using web search engines and within the websites of public health and environmental organisations. A range of activity/health/well-being-associated keywords (e.g. exercise, health, restoration, depression) in combination with a range of environment-related keywords (e.g. park, green, outdoors, countryside) were used to search databases. The bibliographies of included articles were also checked for any additional references. Full details of the search strategy are available (see Additional file 1
). Full background to the conduct of this systematic review can be found at http://www.environmentalevidence.org/SR40.html
Articles were included in the review if they met the following criteria: collection of data on any measure of health or wellbeing after direct exposure to a natural environment and after exposure to a synthetic environment. 'Natural environment' was used in a broad sense to include any environment that, based on author descriptions, appeared to be reasonably 'green': this ranged from gardens and parks through to woodland and forests, and also included environments such as university campuses. Synthetic environments included non-green outdoor built environments or indoor environments. 'Direct exposure' could comprise physical presence within the environment (i.e. some form a passive/sedentary activity) or the use of the environment as a setting for a form of physical activity. We did not include studies that only compared pictures, slides or views of natural and synthetic environments. Both observational and experimental studies were included. Excluded from the review were: studies which investigated the effects of environmental hazards (e.g. air pollution), studies focusing on hypotheses regarding athlete/exercise performance, and studies that were purely descriptive. Title and abstract inclusion criteria were applied by three reviewers (DB, LBA & TK) with consultation in cases of uncertainty. Full text inclusion was repeated by two reviewers on all those identified as potentially relevant (DB & TK).
From all articles that met the review criteria, basic information was extracted into a standardised spreadsheet, which included details of the environment, activity, participants, types of outcomes being measured, and the methodology used to collect data. A methodology quality checklist was devised, guided by items from an available quality assessment tool [25
]. Six binary criteria were used to summarise study quality: definition of target population or details provided on participants in study; random recruitment/third-party referral of participants (as opposed to self-selection); randomisation of participants to environments (or order of environments in the case of a crossover trial); base-line data collection to assess pretest comparability; credible data collection tools; and control of potential confounding factors between environmental settings.
Quantitative synthesis was focused on any comparisons of the same
activity in each environment (natural and synthetic) to investigate the specific effect of environmental setting. This was to ensure consistency in the interpretation of effect sizes from different studies. Four articles which met the review inclusion criteria were not included in the meta-analysis on this basis [26
]. In addition, given that the review included studies measuring a broad range of different outcomes, a threshold number of four studies measuring the same outcome was chosen in order to decide whether to pursue a meta-analysis on a particular outcome.
Numeric data on health/well-being outcomes could usually be extracted from articles in the form of means and standard deviations (or standard errors) from their presentation in a table or a figure (using TechDig 2.0). If data were not available in the article, an attempt was made to contact the author by email for the relevant data. In order to ensure consistency in data extraction, the following rules were specified: in cases where individuals had been measured more than once before an activity, the values taken when individuals were still in similar environments [30
] were extracted; in cases when individuals had been measured more than once after an activity, the values taken at a time closest to the end of the activity [30
] were extracted. This was to enable comparison with the remaining studies, as most took measurements shortly following the end of the activity. The standardised mean difference between the outcome after activity in a natural environment versus after activity in a synthetic environment was calculated. All effect sizes were calculated using Hedges g
and were corrected with the multiplication factor 1-3/(4(n1
)-9) where n1
is the sample size of groups 1 and 2 respectively to account for the known bias of this formula as a population estimator. The sign of the effect size was changed for some outcomes (e.g. anger, anxiety) to reflect the benefit on health/well-being.
In most cases, studies also presented data before exposure to each environment. These data were used to calculate a pretest effect size and we tested the effect of adjusting the posttest effect size by this value to account for any base-line differences. We present the statistics on unadjusted effect sizes and note when the adjustment affects the result. We test the sensitivity of the effect size to this adjustment rather than only presenting the adjusted effect sizes to avoid the possibility that effect sizes are only due to pretest differences, which may simply represent a return to "normal" levels in the group that started off with higher values rather than any effect of the environment.
When data within a study were presented separately for different subgroups, we calculated the effect size for each subgroup and create an average effect size for the study when combining data in the meta-analysis. Similarly, when the same outcome had been measured with more than one test (e.g. different attention tests), we calculated the effect size for each test and used their average. We calculated the overall pooled effect size and its confidence interval as a weighted average of all studies based on a random effects model. Arguably, fixed effects models could have been used when the heterogeneity test indicated an insignificant amount of between-study variance ('heterogeneity'); however, in these cases, similar results were obtained either way. We identified statistically significant effects as those where the confidence interval of the pooled effect size did not overlap zero. Heterogeneity was tested using the Q-statistic, which is calculated as the weighted sums of squares. Studies varied in a number of features (participants, design, environments, etc.), any of which could potentially explain any observed heterogeneity. Due to the low number of studies available, we limited our investigation of heterogeneity to comparator environment type (indoor or outdoor built), which represented the main dichotomy, when heterogeneity was significant (p < 0.05). Egger's tests were used to investigate any evidence for publication bias.