|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: PO MV BA MG FM AM OL PB JA. Wrote the paper: PO MV BA MG FM AM OL PB JA.
The current evidence-base for recommendations on the treatment of cutaneous leishmaniasis (CL) is generally weak. Systematic reviews have pointed to a general lack of standardization of methods for the conduct and analysis of clinical trials of CL, compounded with poor overall quality of several trials. For CL, there is a specific need for methodologies which can be applied generally, while allowing the flexibility needed to cover the diverse forms of the disease. This paper intends to provide clinical investigators with guidance for the design, conduct, analysis and report of clinical trials of treatments for CL, including the definition of measurable, reproducible and clinically-meaningful outcomes. Having unified criteria will help strengthen evidence, optimize investments, and enhance the capacity for high-quality trials. The limited resources available for CL have to be concentrated in clinical studies of excellence that meet international quality standards.
Solid evidence is needed to decide how to treat conditions. In the case of cutaneous leishmaniasis, the diversity of clinical conditions, combined with the heterogeneity and weaknesses of the methodologies used in clinical trials, make it difficult to derive robust conclusions as to which treatments should be used. There also other imperatives - ethical (not exposing patients to treatments that cannot be assessed adequately) and financial (optimize use of limited resources for a neglected condition). This paper is meant to provide clinical investigators with guidance for the design, conduct, analysis and report of clinical trials to assess the efficacy and safety of treatments of this condition.
It is important to harmonize and improve clinical trial methodology for cutaneous leishmaniasis (CL); currently, treatment options are few and the quality of the supporting evidence is generally inadequate, making the strength of recommendations for the treatment of this disease inadequate.
To improve on the case management and control of CL, better treatment modalities with reliable evidence of the efficacy, safety, tolerability and effectiveness is required. High-quality clinical trials are essential to determine which therapeutic interventions can confidently be recommended for treating which form of CL. Today, this is unfortunately not the case in numerous instances.
The inadequacies of trials of different treatments of CL has been documented by two WHO-supported Cochrane systematic reviews ,  which included 97 randomized controlled trials on treatments for Old World and American CL. They revealed critical issues related to the methodological quality of the design and reporting of these clinical trials, which make it difficult to compare results, meta-analyse the studies, and draw generalizable conclusions. Weaknesses ranged from the inadequacy of study design (including appropriate controls, endpoints, outcome measures, follow-up times), execution (randomization, allocation concealment, blinding), analyses and reporting (e.g. use of disparate endpoints) . They also found a large number of trials that did not meet basic criteria, and could not be included in the analyses.
This makes a highly compelling and cogent case for defining and harmonizing elements related to the design, conduct, analysis, clinical relevance, and reporting of trials, and ultimately study acquiescence by regulatory agencies. Improving the quality of studies and harmonizing protocols will make meta-analysis more informative and thus strengthen evidence for recommendations on treatment and case management. Furthermore, conducting inadequate trials may lead to inappropriate conclusion, is both unethical and an inefficient use of the limited resources available for research into this neglected disease.
As heterogeneity is an inherent feature of CL (reflecting the variety of species and manifestations), there are obvious challenges in designing and interpreting trials to assess interventions for CL which will allow deriving generalizable results and recommendations.
The objective of this paper is twofold:
This paper focusses on CL trial-specific issues; it only touches upon more general aspects of clinical trial conduct, which are extensively addressed in a number of relevant papers and documents. For instance the Global Health Trials website  offers several resources including a trial protocol tool .
The collective name of CL comprises several manifestations caused by different Leishmania species in the Old and the New World (OWCL and NWCL) and clinical trial methodology should be adapted to this spectrum of conditions. CL is caused by organisms of the L. mexicana complex and Viannia sub-genus (L. braziliensis and L. guyanensis complex) in the New World and L. major, L. tropica and L. aethiopica in the Old World. L. infantum in both Worlds and L. donovani in the Old World can also cause CL. The wide spectrum of clinical manifestations, natural histories and responses to treatment observed in CL patients is accounted for by the combination of parasite's intrinsic differences and patient's genetic diversity.
The time required for natural cure (“self-healing”) is poorly defined and varies widely; it is generally accepted that lesions caused by L. mexicana in the New World and L. major in the Old World heal spontaneously in a time varying from a few weeks to several months in the majority of patients – except new foci (where the disease tends to be aggressive and self-healing is uncommon), and as opposed to other species (where spontaneous healing barely occurs or requires years). Bacterial super-infections are also frequent and can interfere with healing.
The natural history of the disease must be accounted for when designing a clinical trial. Good knowledge of the disease characteristics at the trial site is essential; it is not possible to extract generalizable data from the published literature. For instance, when considering the placebo arms of randomised controlled trials (RCTs) from the Cochrane systematic review of OWCL1, 3-month cure rates for L. major were 21% in Saudi Arabia and 53% in Iran with oral placebo. With a topical placebo, they varied from 13% to 63% at 2 months in Iran and were 61% in Tunisia at 2.5 months. For L. tropica, cure rates were 0%–10% with oral placebo. In the New World, the information is scarce and more variable, ranging from 0% cure rate at one month in Panama  to 37% at 12 months in Colombia  for lesions most probably caused by L. panamensis. In Guatemala, using topical or oral placebos a 68% cure rate was reported at 3 months for lesions due to L. mexicana and only 2% for lesions due to L. braziliensis , while other studies have reported cure rates of 27% and 39% in the general population at 3 and 12 months respectively , . In Ecuador, in a small group of 15 patients, a cure rate of 75% at 1.5 months (no speciation but likely L. panamensis) was reported without any treatment .
The examples above illustrate the need to acquire and factor in local data on the natural history of disease in order to assess more accurately treatment performance.
A wide variety of treatment modalities has been reported for CL, but none has been shown to be universally effective. Treatment response varies according to a range of factors, including the Leishmania species, the patient immune status and age, the number and localization of the lesions, the severity of the disease, the treatment given and the route of administration, etc.
Treatment would benefit both the individual patient but also reduce the burden of human reservoirs in the case of anthroponotic CL, and prevent super-infection and the resulting complications. The choice of treatment, either local or systemic, is usually based on the size, number and localization of lesions, lymphatic spread or dissemination, patient's immune status, cost, risk-benefit and the availability of the treatment itself in the country. Currently available treatment options (systemic and topical) can be found in the WHO 2010 technical report .
The characteristics of the participants to be included must be adapted to the specific purpose of each clinical trial and must be representative of the typical patients seen in practice. The relevance of the spectrum composition of the study population to the range of patients seen in practice is of paramount importance especially in phase 3 an 4 trials. The factors that allow or disallow someone to participate in a clinical trial (“inclusion” and “exclusion” criteria, respectively), are used to identify appropriate participants and ensure both their safety and sound conclusions of the study. Establishing common grounds for entry criteria is also important in order to harmonize study populations across trials and facilitate comparability of trials and meta-analyses. It is also important to indicate the encatchment characteristics in terms of area and population, which would help in deciding as to the applicability of the findings of a trial, and ensure that the enrolled patients are representative of the larger patient population in that site (“spectrum composition”).
One of more of the following criteria may apply.
Table 1 illustrates how to apply the different entry criteria based on the type of study (Phase 2–4) and treatment being tested (systemic or topical). The final decision, however, must be taken based on prior knowledge accrued during the pre-clinical and phase I studies.
The protocol must identify clearly primary and secondary endpoints for efficacy and safety. The primary efficacy endpoint must be both accurate and robust; the protocol should clarify how and when cure is defined. It is advisable to focus the research on few endpoints that are feasible and attainable within the study, and avoid multiple, diffuse endpoints. Harmonizing efficacy endpoints is essential to allow comparing study results and conducting meta-analyses.
Any procedures applied which may interefere with healing should be standardised upfront and reported in sufficient details. Such would be the case, for standard of care, including dressing, debridement and cleaning of ulcers before and during treatment.
It is generally agreed that cure should be defined based on clinical parameters. Early studies showed that parasitological examination at the end of therapy correlates poorly with the final treatment outcome  and relatively few studies have since based their definition of cure on a parasitological outcome. However, it must be pointed out that, for licensure studies, this point may have to be discussed beforehand with regulatory authorities (which will have no particular knowledge of the disease, and apply traditional clinical microbiology criteria).
Ideally, a clinically accurate definition would include a combination of five parameters (Figure 2): (i) area of ulceration, when present (x by y), (ii) area of induration (x′ by y′), (iii) thickness of induration (z), (iv) colour of infiltrated border, and (v) degree of scaring as a proxy for patient's quality of life.
However, colour and thickness are prone to inter-observer variations and difficult to measure, and quality of life is highly subjective. There is, however, increasing general attention on patient-reported outcomes (PRO) being used as study endpoints. Research into properly constructed PROs should be encouraged. Specifically for CL, this would apply in particular to cosmetic endpoints, like scar assessment methods.
Ulceration area (after debridement and cleaning) is the easiest parameter to measure, and is also clinically meaningful. There is recent evidence that ulceration and induration have parallel evolutions, so both accurately reflect lesion evolution (Buffet, Ben Salah, Grögl et al. Unpublished data).
For species causing predominantly nodular lesions (L. infantum, L. mexicana, L. aethiopica), induration area should be used to measure treatment effects. Measuring induration is more difficult than measuring ulcers, and requires training of the study team on e.g. the ball-pen technique, to ensure inter-observer reproducibility. Only areas of “red” or “inflamed” induration should be considered, while hypertrophic scars (where induration no longer reflects an active lesion but rather an aberrant scarring evolution) should be discounted.
Induration should also be used to capture relapses manifesting as purely nodular lesions (i.e., no ulceration). This is a rare situation where parasitological examination should be performed in order to ascribe the new lesion to the parasite. Satellite lesions occur in 5–8% of the CL caused by L. tropica. These classical lesions do not always contain abundant parasites and may not require parasitological examination, which is invasive.
The use of single time point at which cure rates are compared between arms is simple and practical, but not fully informative. Time to cure is also important for self-healing CL. Actuarial analysis of multiple sequential observations (e.g. product-limit estimate of time-to-cure using Kaplan-Meier analysis) is also possible though more cumbersome and care must be exercised not to over-estimate clinically non-relevant differences – see section on survival analysis below.
Clinical trials conducted between the late 80's and early 2000's , , – showed that tissue repair may take several weeks after the causal factor has been removed (i.e., parasites have been killed). Empirically, 6–9 weeks after treatment start is a reasonable compromise – it leaves enough time for most lesions to heal, yet it is not too long for a patient receiving placebo or an ineffective treatment to receive rescue treatment.
In order to both harmonise and simplify procedures, treatment outcome should be assessed on three occasions (counting from the first day of treatment):
For a unified, standardised efficacy reporting, a simple, dichotomous outcome definition as either “cure” or “failure” should be adopted, whereby “cure” can only be declared at the end of follow-up (Day 180–360), whereas “failure” can occur at any time (and will require rescue treatment).
Figure 3 describes the decision-tree.
“Cure” is defined as:
“Failure” is defined as:
Depending on the natural history of the diseases or its local epidemiological characteristics, additional, secondary parameters may be used to qualify cure, such as the absence of induration, redness or papules around the lesion, or, in case of papules and nodules, parasitological positivity – though after due consideration of its significance.
The assessment and reporting of the safety, toxicity and tolerability of treatments, while an essential component of the evaluation, is often overlooked in CL clinical trials. Topical treatments may produce local events at the site of the lesion (like irritation); systemic treatments may cause generalised signs or symptoms, including changes in laboratory values.
Events should be reported and graded using standard nomenclature and criteria of severity. Whenever possible, events must be combined under a syndrome or diagnosis.
It is important to comply with regulations for filing serious events; specific requirements exist for timely reporting accoriding to national regulations (health authorities, regulatory authorities, ethics committees). However, investigators must be alerted to the fact that definitions and rules for reporting may evolve with time and are not fully harmonised between countries.
This section treats of study design with a specific focus on issues of special relevance to comparing treatments for CL. In this context, we delve more into types of design (such as non-inferiority trials, adaptive designs) that the typical CL investigator might be less familiar with.
According to the recent WHO treatment recommendations for leishmaniasis, including CL, there are cases (e.g. uncomplicated L. major) where an unfavourable risk-benefit ratio (resulting from the combination of a self-curing lesion and the lack of an effective and safe treatment) means that no treatment may currently be recommended (and thus no standard treatment exists to which to compare) . In other cases, cure rates up to or above 90% have been reported following different treatments, though results depend also on the duration of follow-up , . However, even when efficacy is high, the risk-benefit of some such treatments is not always well-established, or in favour of the intervention (e.g. systemic toxicity associated with the use of parenteral antimony).
These elements must be accounted for when designing a clinical trial for any specific form of CL. These trials will belong to either of the following types: Phase 2 (safety and dose-finding studies to select the dose and duration of treatment which is safe and effective to be tested further in larger efficacy studies); Phase 3 (randomized controlled trials (RCTs) to establish the value and support the registration of a new intervention with superiority design (over reference treatment or placebo) or non-inferiority design (against a reference standard treatment); or Phase-4 trials (post-registration, when the new treatment is being implemented in the field in conditions that are closer to real life). All studies, whether with or without a direct external comparison, should have at least two arms and be randomized, with few exceptions.
Current WHO recommendations  provide for multiple options, including no treatment, topical or systemic treatment, depending on the species and clinical judgment. Therefore the choice of the reference treatment will have to based largely on local experience and expert opinion – yet supported by reliable data. According to the International Conference for Harmonization (ICH) , the choice of a control group should consider its ability to minimize bias, ethical and practical issues associated with its use, usefulness and quality of inference, modifications of study design or combinations with other controls that can resolve ethical, practical, or inferential concerns, and its overall advantages and disadvantages. The guidelines include five types of control groups: i) placebo, ii) no treatment, iii) different dose or regimen of the study treatment, iv) a different active treatment, v) external historical controls (the latter being of very limited use as it carries important biases and raises serious concerns as to between-groups comparability).
Few cases will warrant a placebo or no-treatment arm unless this is as an ‘add-on’ to generally accepted (partially) effective treatment . The choice of giving patients no treatment or a placebo must be on solid scientific and ethical foundations. A no-treatment arm may be justified in case of uncomplicated, self-healing lesions and will provide much needed information on the natural history of disease upon which future studies can be built – although this may be site-specific and non-generalizable. Such an option will however depend on ethical considerations and local regulations.
It is important to be clear as to what is meant by “placebo”; as a placebo should match the active drug, it may be oral or topical – it is difficult to conceive an injectable placebo. Between the two, the only genuine placebo is oral. Basic interventions like cleaning and protecting the lesion against super-infections, as well as topical placebos are known to modify the natural history of the disease, and will likely accelerate the self-healing rate. For clarity, the term “vehicle control” should be preferred over “topical placebo” when it is made of a cream or ointment with only excipients and no active ingredient. This effect on wound healing should be considered in placebo-controlled trials, though the increased cure rate obtained with the intervention over and above “topical placebo” will be difficult to quantify.
Whatever the comparator, superiority randomized controlled trials (RCTs) are intended to provide evidence that the test intervention is superior to the control intervention.
Calculations and examples follow. The basic statistical elements to be considered in designing a trial are summarized in Box 1.
The choice of the values of type one error rate, α, and power, 1- β (i.e. how stringent the study will be), as well as the expected cure rates with the control and the improvement to be detected for the test intervention will determine the sample size of the study. Noteworthy, reliable efficacy data for the comparator arm are needed; wrongly estimating the efficacy of the comparator treatment may result in the study being underpowered, hence failing to produce the intended results.
When the number of arms is >2 (i.e. >1 test intervention or dose), this will have to be accounted for in sample size calculation and result in a larger sample size per group, other things being equal, in order to allow for multiple comparisons.
The study may be designed to compare proportions (cure rates) between the control and test intervention, but also means (e.g. of size of lesions). A non-significant result (i.e. no significant difference detected) does not imply that the two treatments are equal .
Examples of assumptions and their implications in terms of sample size calculations are provided in Figure 4 and Table 2, assuming: a two-tailed test, α=0.05; power (1-ß)=0.80, 0.85 or 0.90; success rate of the comparator drug=60–90%; and δ=10–30%. The larger the δ, and the more effective the reference intervention, the smaller the sample size. In the typical example of a superiority design with the reference treatment being 80% effective, expecting a 10% difference with the test treatment (90% effective) with power=0.80, 199 patients per arm would need to be recruited. For comparison, a 10% difference with a reference treatment that is 70% effective will require 294 patients.
In addition, in calculating the sample size, allowance should be made for losses to follow-up - a parameter which is very much site-specific.
The intent-to-treat (ITT) is generally considered the choice population for analysis; it comprises all patients randomized who gave informed consent and received any amount of the assigned intervention at least once. The practical problem in applying ITT is that it requires measurement on all patients whether or not they are still adhering to the protocol. Thus as soon as one has ‘loss to follow-up’ it is not possible to apply a pure ITT analysis. This population reflects treatment effects in conditions that are closer to those encountered in routine use, as opposed to the per-protocol (PP) population, which is restricted to the patients without major protocol deviations who are evaluable at the planned visit for efficacy assessment and thus measures the pure treatment effect (“evaluable patients' analysis”). The mITT population definition is used to overcome the bias of the ITT population. It is a subset of the ITT population allowing for the exclusion of patients due to non compliance or missing outcome. Conclusions will be drawn from the results on the primary criteria calculated on the ITT or the modified-ITT (mITT) population.
Non-inferiority trials are intended to show that the new intervention is no worse than the standard drug by some margin Δ (the non-inferiority margin), defined as the largest clinically acceptable difference ; it should be smaller than differences observed in superiority trials of active comparator .
The non-inferiority design has become increasingly popular in malaria and tuberculosis (where very effective treatments exist), but is rarely used in leishmaniasis; so far, it has been used for visceral leishmaniasis (VL) randomized controlled trials in India ,  and East Africa (DNDi clinical trials.gov NCT01067443).
The choice of the non-inferiority margin is very important as it governs the validity of the trial, and has also ethical implications . The objective is to avoid harmful treatment to be declared non-inferior, and to retain a treatment that brings a true benefit for the patient . The decision should be based on previous studies with the reference treatment and the minimally important effect that one wants to observe with the new treatment which would provide additional benefit for the patients.
In order to identify the correct Δ, it has also been proposed to compare (i) the two-sided 95% confidence interval of the difference between the test and the reference treatment to (ii) a two-sided 95% CI of the difference between the reference treatment and the placebo based on historical data and meta analyses (if such data are available) . Virtual comparison methods are also available, whereby the new treatment is compared to a putative placebo by synthesizing the estimated effectiveness of the former versus an active control and the estimated effect of the latter versus the placebo .
It is important to note that defining the Δ is not a mere statistical exercise; it requires consideration of what is a clinically acceptable failure rate, in the context of other factors, such as practicalities (duration of treatment, route of administration) and costs.
Calculations and examples follow. The basic elements to be considered in designing a non-inferiority trial are similar to those of a superiority design. The difference is in the choice of the margin and the test used to compare the treatment estimates. When success or failure rates are used to measure treatment effects, it is common to compare the 95%CI lower limit to the non-inferiority margin. However, in the case of proportions, it should be also of interest to compare risk ratios (RR) or odds ratios (OR) with a non-inferiority margin specified on the RR or OR scale.
In the examples that follow we work with proportions and 95%CI. The sample size is calculated based on the expected proportion of events in the reference arm (80%, 85%, 90% or 95%), the expected true difference in proportions between the reference and the tested treatment arms (0%), α risk=0.01, unilateral hypothesis, and power (1-ß)=90% the equivalence margin defined as acceptable for concluding that a tested treatment is not inferior to the reference arm (from 5% to 10%; meaning that one is prepared to accept that the test treatment is 5% or 10% less effective than the reference treatment).
The larger the Δ, and the more effective the reference intervention, the smaller the sample size. Using a reference treatment that is 80% effective, the sample size varies from 1667 (5% Δ) to 417 (10% Δ); similarly, for the same Δ=10%, the sample will be 124 when the reference treatment is 95% effective.
The total sample size would allow an assumption on the expected proportion of drop-outs (5% for instance) and multiply by 2 (groups). In case of more than 2 groups being studied, the calculation will have to allow for an adjustment for multiplicity such as the so-called Bonferoni correction. More results are presented in Figure 5 and Table 3.
These calculations show the importance of the non-inferiority margin and the proportions for the reference treatment. When the α risk and power are fixed, the sample size can grow exponentially whenever a little change is done in the assumptions.
Between the ITT and the PP populations, ITT may bias the results toward equivalence, which could make a truly inferior treatment appear non-inferior , –. ITT analysis carries the risk of falsely claiming non-inferiority  although this may not always be the case  (reviewed and discussed in Piaggio et al ).
According to Abraha et al , in non-inferiority trials “excluding participants who did not adhere fully to the protocol can be justified. Exclusions may, however, affect the balance between the randomized groups and lead to bias if rates and reasons for exclusion differ between groups , ”. The current thinking of regulatory agencies is that the study objective should be achieved in both the ITT and PP populations, especially in a non-inferiority trial . However, Maltilde-Sanchez et al  argue that this “does not necessarily guarantee the validity of a non-inferiority conclusion and a sufficiently powered PP analysis is not necessarily powered for ITT analysis”. These authors propose to perform a new maximum likelihood-based ITT analysis arguing that it could address “the potential types and rates of protocol deviation and missingness that might occur in a non-inferiority trial” and that “prior knowledge regarding the treatment trajectory of the test treatment versus the active control at the design stage” should be collected “so that a proper analysis plan and appropriate power estimation can be carried out”.
Illustrating the divergent conclusions toward non-inferiority between the ITT and PP populations is outside the scope of this work. Neverthless the examples provided in Table 4 (which use rates derived from published NWCL studies at 6–12 months of follow-up) illustrate how much exclusions can influence the sample size required to prove non-inferiority: the more patients are excluded and the less effective the reference treatment is, the larger the sample size required for a given non-inferiority margin – obviously the sample size decreases when the non inferiority margin increases. This means that different conclusions as to non-inferiority may be reached on the ITT vs. the PP populations. Therefore, special attention must be paid to minimizing losses to follow-up and numbers of patients deemed non-assessable, both of whom would be deducted from the PP population.
A precision estimate can be used when one can estimate success/failure rates or means as well as mean difference from previous studies done in a different environment or time period. The objective is therefore to evaluate this estimate and its variability in a new population.
Examples of sample size with precision estimate  if the required success rate is
The precision estimate is used in the case of non-comparative design, therefore it cannot judge the efficacy of a treatment comparatively to placebo or an active treatment. It could be used however for dose-finding.
These designs are meant to allow choices amongst various drugs and regimens (dose, duration) systematically, as quickly and effectively and with as few patients as possible. The term includes group sequential designs, sequential methods and methods to stop earlier trials with superiority or non-inferiority designs.
Adaptive trials designs are increasingly used to improve efficiencies in the R&D process. This approach allows redesigning the trial based on the information acquired through interim analyses, which may result in changing the sample size, the number of arms, or other elements. Sequential and group sequential trials are a special case of adaptive trials where several interim analyses are done in order to complete earlier the trial based on the accumulated information. We will concentrate here on sequential methods, and more specifically on the Whitehead triangular test, a graphical methods defining with boundaries which allows for early rejection or non-rejection of H0.
Examples using the Whitehead triangular test  follow. In this example, the hypothesis to be tested will be a difference of 8% between the failure rate (in %) of each group and the boundaries calculated for 10 discrete stages of evaluation.
The type I and type II risk are commonly set at α=0.05, power (1-ß)=0.80 i.e. the risk to reject an effective treatment is 5% and the chance for the study to find an effective treatment is 80%. The null proportion is set at 0.1 and the alternate proportion is set at 0.18, 0.20 and 0.25. These assumptions mean that if the failure rate <10%, efficacy is considered adequate, and if the failure rate ≥25% efficacy is insufficient. In terms of probabilities, it can be written that the boundaries of the test are calculated for H0(p≥p0) and Ha(p<pa) with p0=0.25 and pa=0.10.
Different sample sizes estimated when varying pa, the alternate proportion of failure: pa=0.18: min=8, max=80; pa=0.20: min=7, max=67; pa=0.25: min=5, max=46 see Figure 6, Table 5, provides an example of calculations for a two-sided test.
The triangular test is not without shortcomings, especially in the context of diseases like CL: (i) it is most effective when early end-points exist, which is not the case for CL, though one could consider use of a surrogate marker e.g. 50% re-epithelisation at 42–63 days or another clinically-relevant parameter; (ii) it also requires an efficient (on-line) data-management system in place and a constant interface with a statistician.
The advantages of sequential methods such as the triangular test is that they allow the analysis of the cumulated information at each step, early stopping (when treatment proves effective (p0) or ineffective (pa)), non-comparative and comparative designs, and can eventually result in shortening study duration and reducing the number of subjects to be exposed. As with the fixed sample designs, several treatments (or doses) can be tested in parallel, which is particularly useful for dose-finding (Phase 2) studies.
It would also be possible to conceive a design combining sequentially in a single study (1) screening of potential treatments (one-sided triangular test applied to multiple non-comparative studies as required) and (2) comparing the so selected treatment to the reference treatment (two-sided triangular test).
In a trial testing a new drug, one has to make assumptions on the number of subjects who will not complete the study for any reason. These subjects may not be properly accounted for in the typical ITT or PP populations analysis because they would not have reached an endpoint that makes them qualify for the analysis - in the first case they will be counted conservatively as failures (though they are not demonstrated failures) and in the latter they will be discounted. When dealing with cure rates, one way to circumvent this problem is to use survival (time-to-event) analysis whereby the information accumulated by a subject while on study is accounted for up until the time that s/he drops out of the study or reaches a study endpoint. Withdrawals such as drop-outs, failures or deviations will be censored at the time such event occurs and accounted for, for as long as the subject has been on study.
While this approach is rarely used in CL , , it would have also the additional advantage of accounting for time-to-healing, which is an important consideration when comparing treatments, or comparing treatment to a placebo (because of variable tendency to natural healing and effects of (topical) placebos on the natural rates of recovery).
Specifically, interventions would be assessed based on the survival estimate of healing at a specific day (e.g. end of follow-up) evaluated for instance using the Kaplan-Meier  method (other methods exist) as shown in
Figure 7. It is advisable to include denominators at each time-point of the plot to show the decreasing numbers of patients contributing to the analysis as time goes on.
Outcomes between arms are normally compared using the Log-Rank test, or the proportional hazard model (which allows adjustment for independent factors; furthermore, it estimates also the relative risk (hazard ratio) with one arm over the other one).
Survival analyses can be applied both to superiority and non-inferiority trials, but sample size calculation should be adapted in the latter case (Vaillant & Olliaro, manuscript in preparation).
Herewith we provide an example of a sample size calculation for a non-inferiority trial based on the assumption of a 3-months study duration and a cumulative drop-out rate of 10%. With a type one error α=1% and a power 1-β=90%, assuming a cure rate of 80% with the reference treatment and non-inferiority margins of 10%, 7% or 5%, the total sample size required to demonstrate non-inferiority would be 1030, 2020 or 3842 patients respectively. Additional calculations with cure rates of 85% and 90% are also presented in Table 6
When comparing calculations made allowing or not for product-limit estimate analysis, the latter appear to underestimate systematically the sample size by a factor that is proportionally higher as the δ and reference treatment efficacy increase (from 13% with δ=5% and 80% efficacy to 33% with 10% δ and 90% efficacy (Figure 8)) and the total sample size decreases (from 3842 to 706 patients and 3334 to 470, respectively).
This section provides general directions as to the choice of the appropriate trial design for CL. Against the backdrop of the general lack of standardization and inadequate design  in CL clinical trials, as well as the considerations listed above, different designs will befit different questions:
All trials should be registered (see: the WHO International Clinical Trials Registration Platform (WHO-ICTRP) and reported, whether the results are favourable, unfavourable or inconclusive – both for ethical and scientific reasons. Traditionally, the importance of negative results has been underestimated both by researchers and publishers; publishing only positive results will bias knowledge. The CONSORT checklist (study design, analysis and interpretation) and flow diagram (patient attrition throughout the study) should be followed . All major journals today do not publish papers on trials that have not been registered and do not follow the CONSORT guidelines (see example in Figure 9).
The protocol must be clear as to the population for analysis – typically: intent-to-treat (ITT), modified ITT (mITT) and per-protocol (PP). The basis for exclusion of patients from the analysis must be provided. Patients withdrawn because they could not tolerate treatment or because they required rescue treatment must be accounted for. The analytical plan should be finalised before freezing the data for analysis.
Like any other trial, an appropriate data management process is critical in order to have high-quality data, statistical analyses and results. For this purpose, the data management software adopted must provide a secure location for the clinical data, user rights and profiles along with password protection, as well as an audit trail. Capacity for data management is often scarce in CL-endemic countries, including both the availability of appropriate software with auditable track, and trained data managers. In these countries there is also a general shortage of statisticians to help design and to analyse and report on trials. Capacity building efforts should be organized to increase competences of research teams in this important area.
Clinical trials must be conducted in accordance with current international standards of Good Clinical Practice (GCP), an international ethical and scientific quality standard for designing, conducting, recording and reporting trials that involve the participation of human subjects. Compliance with this standard provides public assurance that the rights, safety and well-being of trial subjects are protected, consistent with the principles that have their origin in the Declaration of Helsinki, and that the clinical trial data are credible. When GCP standards are followed, the quality of data from clinical trials is adequate to make informed clinical and policy decisions.
There is a belief among some that GCP guidelines are only for “registration” studies and not for all clinical trials. However, the principles of GCP should be applied to all clinical studies with any intervention conducted at any stage of development that may have an impact on the safety and well-being of human subjects. Implementation of GCP procedures requires initial training and practice and is best served when trial personnel at a site accept and understand a culture of GCP. Maintaining a GCP environment requires constant training and reinforcement and is a process that requires continuous growth in a site and personnel. Accepted GCP standards include those published by the International Conference on Harmonization (ICH) and the World Health Organization (WHO). The ICH GCP guideline is published under Efficacy (E6) and is often referred to as ICH E6 GCP guideline . A summary review of the principles of GCP are found in the WHO handbook .
At the same time, it should be clear that GCP is not about dogma, but rather patient's care and reliability of data, and that the context within which trials occur should be accounted for. A proper balance between the goals of the clinical study and the documentation required has been proposed . The amount of written documentation and the degree of detail required by GCP procedures can be a shock to investigators not used to working in this environment. Although the conduct of clinical trials under GCP with external monitors and proper data management will inevitably increase the cost of studies, it is imperative that higher quality studies in CL be conducted.
For all trials involving human subjects, ethics review and approval must be sought from appropriate boards/committees at the institution (local and/or international) and/or country level as required. It is imperative that all clinical studies are conducted in accordance to the international and country regulations and laws.
The opinions expressed in this paper are those of the authors; the authors alone are responsible for the views expressed in this publication and they do not necessarily represent the decisions, policy or views of the WHO.
Material has been reviewed by the Walter Reed Army Institute of Research. There is no objection to its presentation and/or publication. The opinions or assertions contained herein are the private views of the author, and are not to be construed as official, or as reflecting true views of the Department of the Army or the Department of Defense.
The authors would like to thank all the participants to the workshop organised by WHO/NTD and WHO/TDR during 15th–17th December 2009, which provided the basis for the current paper. Specifically: Mohammad Hossein Alimohammadian, Fabiana Alves, Abraham Aseffa, Afif Ben Salah, Urbà González, Lama Jalouk, Alejandro Llanos-Cuentas, Iván Dario-Velez. We are also indebted with Steven Senn for reviewing the study design and statistical section and with Pascal Launois and Christine Halleux for critically reviewing the manuscript.