We have identified the following nine challenges: (1) heterogeneous and complex interventions; (2) paucity of trial data; (3) selecting outcomes of interest; (4) using indirect evidence; (5) integrating values and preferences; (6) considering resource use; (7) addressing social and legal barriers; (8) wording of recommendations; and (9) developing global guidelines. We discuss below and summarize in Table
each of these challenges and how we addressed them. The order of presentation is not related to the seriousness of the challenges. The first four challenges relate to the grading of the quality of evidence while the other five relate to the grading of the strength of recommendation.
| Table 1Summary of the challenges and applied solutions |
Heterogeneous and complex interventions
The nature of public health interventions differs from that of most clinical interventions in a number of respects. While clinical interventions can be heterogeneous, complex and used in combination, public health interventions tend to possess these characteristics more frequently and to a larger degree
[
19].
We frequently identified heterogeneity in the characteristics of interventions evaluated in studies. For example, studies examining group behavioral interventions varied in the number of participants per group, in the number of sessions, and in the topics discussed (e.g., sexual risk behavior, safer sex, condom use, healthy relationships, etc.).
Our challenge was to decide whether these interventions were similar enough to consider (and meta-analyze) these studies together, and to ultimately issue one or multiple recommendations. In the case of behavioral interventions, we classified them a priori based on whether they were implemented at the individual, group or behavioral level. We faced similar difficulties with the reviews of sex venue-based interventions, social marketing campaigns, and Internet-based targeted interventions.
Many public health interventions are implemented in combination with other interventions because of their assumed synergistic effect. For example, counseling interventions are implemented along with HIV testing to improve understanding and behavior change. In the field of HIV, this has been recently discussed as combination HIV prevention
[
20,
21]. Furthermore, the specific combination of interventions depends on the setting and local needs– the “know your epidemic know your current response” approach
[
22].
For global guidelines, it is not practical to issue recommendations for all possible combinations of interventions, and it is very unlikely that studies addressing all different combinations exist. Our approach was to consider separately each individual intervention to the extent that the available evidence allowed this. The aim was to provide a menu of effective individual interventions for policymakers to choose from according to their local conditions and need. The main limitations to this approach are potentially missing any synergistic effect, and dealing with evidence assessing combinations of interventions.
Another methodological challenge associated with the evaluation of these interventions is the risk of contamination (e.g., with information diffusion interventions) and the need for cluster randomized trials
[
23]. Also, the success of some of these interventions depends on the individuals implementing them (e.g., behavioral change interventions), requiring careful evaluation methods
[
24].
Paucity of trial data
The scarcity of trial data and using non-trial data were a matter of debate during our guideline development process, as they have been for others involved in guideline development
[
25].
The evidence for all but four questions came from observational studies resulting, according to the GRADE approach, (
Additional file
1)
[
13] in ‘very low’ or ‘low’ quality of evidence for 12 out of 15 graded recommendations. Also, according to the GRADE approach, recommendations based on lower quality of evidence are likely to be conditional, as opposed to strong (
Additional file
2)
[
12]. Out of the 15 grade recommendations, 10 were graded as conditional.
Some panelists at the guidelines consensus meeting were uncomfortable with the relatively low rating of the quality of evidence. They argued that conducting randomized controlled trials (RCTs) for public health questions might be either challenging (e.g., community level implementation of a behavioral intervention), or impossible (e.g., legal interventions). Therefore, they felt that the available evidence from non-randomized studies should be rated higher
[
26-
28].
However, the fact that the best available evidence comes (or could only come) from observational study designs does not automatically imply that these designs provide high quality evidence. Indeed, reasons for which we have less confidence in these designs relative to RCTs are valid irrespective of whether trial data is (or could be) available. In addition, observational studies could potentially provide “moderate” and even “high” quality evidence within the GRADE framework
[
15]. For example, the quality of evidence from cohort studies for the effectiveness of condom use among MSM was rated up to “moderate” due to a large effect size (relative risk of 0.34).
Panelists may have been uncomfortable with the low rating of the quality of evidence because it would likely lead to a weak recommendation. There was a concern that policymakers may use both “low quality evidence” (i.e., low confidence in effect estimates) and “weak recommendations” as excuses to forgo the implementation of the recommendation. Please refer to challenge (8) on how we used wording of recommendations to address these concerns.
Although GRADE stipulates that trial data provide the least-biased evidence for the effects of interventions, it does consider other types of studies. Firstly, it considers observational studies for assessing (1) the long term and rare effects of interventions, and (2) the effectiveness data in the absence of trial data, which – as discussed above – may lead to moderate or high quality evidence (
Additional file
1). It is possible that observational data could provide higher quality evidence than poorly conducted trials.
Secondly, in order to generate accurate absolute estimates of effects, GRADE calls for deriving baseline risks of outcomes from observational studies. For example, in assessing the absolute effects of behavioral interventions on HIV incidence, we derived the baseline risk from cross sectional studies appropriate for the low-, intermediate-, and high-risk groups
[
29,
30].
Thirdly, assessing public health interventions benefits from a broad range of evidence, e.g., from the behavioral and social sciences
[
25]. One example is the use of process evaluation of complex interventions
[
31]. By capturing the experiences of participants and the details of the implementation, qualitative and quantitative studies help evaluate the different components of the intervention and investigate its contextual factors (e.g., socio-cultural factors which can act as mediators and moderators). For studies with positive results, observational data aid in assessing transferability
[
32]. Studies finding no effects can help in distinguishing between interventions that are inherently ineffective and those which, because of poor implementation, were not fairly evaluated
[
32].
Identifying process evaluation data requires additional time and resources, a specific set of skills and expertise, and searching outside the conventional biomedical databases. Moreover, this type of evidence is not readily available
[
33]. Indeed, we did not identify any process evaluation study for any of the RCTs considered in this guideline. This proved particularly problematic in interpreting the results of a group level behavioral intervention that unexpectedly showed increased rates of STI
[
34]. Subsequently, the panel was not able to reach a consensus on whether to recommend group level interventions on the basis of this one study.
Selecting the outcomes of interest
Selecting important outcomes for each PICO question is a critical step in the development of recommendations
[
17], especially for public health guidelines.
The important outcomes should include all benefits and harms of the intervention. One additional important outcome that the panel considered was quality of life. In fact, it is thought that MSM are using condoms less frequently because it affects their sexual experience negatively. Another important unintended effect that the panel considered is discrimination
[
35-
37].
The choice of important outcomes should be independent of whether or not they have been empirically assessed, while the choice of intermediate outcomes should capture those that have been empirically assessed. These outcomes should be selected in a transparent and comprehensive manner, and
a priori (i.e., prior to reviewing the evidence). In order to achieve these goals, we developed an outcome framework for each question or group of related questions (Figures
,
)
[
38]. Each framework describes all possible pathways between the intervention and the important outcomes.
Public health interventions can affect both individual-level and community-level outcomes. For example, an HIV testing and counseling intervention may affect the timeliness of diagnosis and treatment at the individual level, and HIV transmission at the community level (Figure

). In both cases, the testing intervention would eventually affect both morbidity and mortality. The outcome framework helped in depicting these two distinct but convergent pathways.
The pathway for a prevention intervention has three levels of outcomes: (1) behavioral change, (2) HIV acquisition and transmission, and (3) morbidity and mortality (Figure

). In this pathway the outcomes of utmost importance are morbidity and mortality. A conservative approach might consider these outcomes as the only important ones. Practically, this would lead to downgrading the quality of evidence associated with HIV acquisition and transmission for indirectness, leading to moderate quality evidence at best. The guideline panel made the judgment that HIV acquisition and transmission are relevant enough outcomes that downgrading the quality is not warranted. This judgment was based on the fact that a reduction in HIV transmission is highly associated with a reduction in morbidity and mortality associated with HIV infection. On the other hand, the panel made the judgment that behavioral change is an indirect outcome warranting downgrading the quality of evidence.
Indirectness of the evidence
In addition to the indirectness of the outcome (see preceding paragraph), the panel dealt with the indirectness of the population and of the setting, a concept also known as applicability
[
39]. Although these guidelines address both MSM and transgender people, we did not identify direct evidence for transgender people. The panel made the judgment that the indirectness of the evidence for that population was not serious enough to warrant downgrading its quality. However, one has to acknowledge that the degree of indirectness varies across the questions. For example, the evidence is likely to be more indirect for the behavioral interventions relative to screening interventions.
As to the setting, for many PICO questions, the available evidence came solely from high-income countries. The judgment of the degree of indirectness depended on the intervention. For example, the panel judged that the evidence about the effect of condom use on HIV infection is not indirect enough when applied to low-income countries to warrant downgrading the quality of evidence. Conversely, the panel judged that the evidence about the effects of serosorting on HIV infection is indirect enough to warrant downgrading the quality of evidence, because the practice of serosorting requires regular high-quality and easily accessible HIV testing, re-testing and counseling that is frequently not available in low- and middle-income countries. Guideline panels may use more formal approaches to judging applicability
[
40].
Integrating values and preferences
In the GRADE framework, the values and preferences of the target population are a major factor in determining the direction and strength of recommendations
[
14]. For these guidelines, the perspectives of MSM and transgender people were incorporated in a variety of ways. Firstly, community representatives were members of the core working group and the final meeting consensus panel. Secondly, two community members from a low-income and a high-income country respectively reviewed the final drafts of the guidelines. Thirdly, the WHO secretariat commissioned the Global Forum on MSM and HIV (GFMSM) to conduct a survey of both HIV positive and HIV negative MSM and transgender people from Asia, Africa and Latin America. The survey consisted of online interviews about the values and preferences community members attach to the outcomes and interventions considered in the guideline questions, the implications of the proposed guidelines, and concerns that might emerge among them from their potential implementation. The results of the survey were an integral part of the decision tables used, in some instances verbatim, at the guideline consensus meeting. The working group discussed the possibility of conducting a systematic review of studies of values and preferences relevant to the guideline question, but did not pursue this because of time and resource limitations.
The above approach of integrating values and preferences of the target population has resulted in the guidelines taking the perspective of MSM and transgender people. The advantages of that approach include improving the quality of the guidelines, and increasing their chances of acceptance and implementation by the MSM and transgender communities. One disadvantage is a potential inconsistency of some of the recommendations with differing set of values and preferences in certain settings.
Considering resource use
Resource use is one factor that is considered in determining the direction and strength of recommendations in the GRADE framework
[
14]. Resource use becomes particularly important for guidelines targeting low- and middle-income countries. Indeed, the availability of resources and costs are likely to vary substantially across low- and middle -income countries. Unfortunately, the expertise and resources were not available for the panel to consider this factor in a systematic and formal way. However, the decision tables for each recommendation included a judgment about the implications of resource use. For example, condom use was seen as a relatively not resource intensive intervention. On the other hand, medical male circumcision was seen as a resource-intensive intervention, particularly in settings in which such programs are not being rolled out for the general male population. This affected the direction of the recommendation and resulted in a conditional recommendation against male circumcision in spite of low quality evidence suggesting benefits may outweigh harms.
Addressing social and legal barriers
Discrimination, stigma, punitive laws and law enforcement practices are major barriers for MSM and transgender people in accessing health services
[
41-
43]. This undermines the effectiveness of HIV prevention and treatment programs, particularly in low- and middle-income countries
[
44]. The panel felt a need to include a strong recommendation to make health services inclusive of MSM and transgender people, and more generally to ensure protective laws and regulations. However, there was no direct evidence to justify a strong recommendation, if one followed the typical GRADE approach. The resolution was to frame these recommendations as ‘good practice recommendations’ – as defined by GRADE – and base them on the principles of medical ethics and human rights
[
7].
Good practice recommendations are typically those in which desirable effects undoubtedly outweigh any undesirable ones so that conducting a study addressing the implicit question could not be justified
[
7]. Indeed, a test of whether a recommendation qualifies as a ‘good practice recommendation’ is to check whether the alternative sounds bizarre or ridiculous. GRADE suggests using ‘good practice recommendations’ for interventions that represent “necessary and standard procedures of the clinical encounter or health care system”
[
7].
Here are the two good practice recommendations included in the guidelines:
We recommend making health services inclusive of men who have sex with men and transgender people, based on the principles of medical ethics and the right to health.
We recommend that legislators and other government authorities establish anti-discrimination and other protective laws, derived from international human rights standards, in order inter alia to eliminate discrimination and violence faced by men who have sex with men and transgender people, and reduce their vulnerability to infection with HIV and the impacts of HIV and AIDS.
Wording of recommendations
Guideline panels present their judgments about the quality of evidence and strength of recommendations using specific wordings of the recommendation statements and affixing grades to these statements (using a combination of letters, numbers, and symbols)
[
45]. Unfortunately, there is little evidence of how well various presentations are understood
[
45-
47].
GRADE initially suggested to use the words ‘strong’ and ‘weak’ to characterize the strength of recommendations
[
12]. It has become clear with experience that for certain panelists ‘weak’ is not acceptable wording. Specifically, public health guideline panels worry that policymakers use this wording as an excuse to forgo the adoption of the intervention being recommended. This is in spite of the fact that a weak recommendation is intended to invite policymakers to involve their stakeholders in substantial debate in considering the intervention (as opposed to a strong recommendation intended to invite policymakers to adopt the intervention as a policy)
[
12]. Thus, the GRADE working group suggested alternatives to ‘weak’ such as ‘conditional’, ‘contingent’, and ‘qualified’. The core guideline working group adopted the term ‘conditional’ and the panelists were very receptive.
The potential advantage of the term ‘conditional’ is that it invites the user of the recommendation to consider the implications of that term. ‘Conditional’ can have four possible implications (or combinations of these), depending on which factor(s) affected the strength and/or direction of the recommendation (
Additional file
3). Table
provides the implications and examples for each of these four factors.
| Table 2Implications of a conditional recommendation according to the factors that affected the strength and/or direction of the recommendation |
The panel was also deliberately sensitive in wording the statement of the recommendation. While typically a recommendation would refer to the ‘use’ of a specific intervention, the guideline recommendations refer to the “offering” of an intervention. The intention was to avoid the perception of coerciveness especially in settings where MSM and transgender people are particularly marginalized and at risk of stigma, violence and abuse.
Developing global guidelines
Developing guidelines with global scope is challenging, whether the topic is of clinical or public health nature. This may be particularly true for public health guidelines to be adopted by different countries or by different jurisdictions within a country. Indeed, the importance of the problem (and consequently the size of the effects), and the implications of an intervention (in terms of the availability of resources, costs, cost-effectiveness, acceptability, and feasibility) often vary substantially across settings.
In order to address this challenge, the survey of values and preferences recruited participants globally. We prioritized evidence of effectiveness and of incidence of outcomes from low- and middle-income countries when available. Also, the panelists were asked to prioritize perspective of low- and middle-income countries when making judgments about values and preferences, feasibility and resource use.