A random sample of 100 papers from the general orthopaedic literature was obtained and included 27 randomized controlled trials (RCTs), 30 case–control (CC) studies, 16 longitudinal (L) studies and 27 cross-sectional (CS) studies. The sample was stratified by study type to ensure accurate representation of each of the four types of study and additional inclusion criteria were as follows:

(i) Published research papers from seven general orthopaedic journals
[

5] covering a range of impact factors
[

6];

*Journal of Bone and Joint Surgery* (American),

*Clinical Orthopaedics and Related Research, **Journal of Bone and Joint Surgery* (British),

*Acta Orthopaedica, **Archives of Orthopaedic * and

*Trauma Surgery, International Orthopaedics* and

*BMC Musculoskeletal Disorders*(ii) Original research only – excluding trial protocols, reviews, meta-analyses, short reports, communications and letters

(iii) Published between 1^{st} January 2005 and 1^{st} March 2010 (study start date)

(iv) No more than one paper from any single research group

(v) Papers published by research groups based at our own institutes were excluded to avoid assessment bias

Full details of the search strategy and methods used to collect the sample are provided by Parsons et al.
[

2].

The statistical quality of each paper was assessed using a validated questionnaire
[

7], which was adapted to reflect the specific application to orthopaedic research
[

2]. After randomly numbering the papers from 1 to 100, each paper was read and independently assessed using the questionnaire by two experiences statisticians (NP and CP). Even numbered papers were read by NP and odd numbered papers were read by CP. The questionnaire was divided into two parts. Part one captured data describing the type of study, population under study, design, outcome measures and the methods of statistical analysis and the results of this were reported in Parsons et al.
[

2]. A random sample of 16 papers from the original 100, stratified by study type to ensure balance, was selected and read by both statisticians to assess the level of agreement between the two reviewers for individual items on part one of the questionnaire. Parsons et al.
[

2] reported kappa statistics in the range 0.76 to 1.00 with a mean of 0.96 suggesting good agreement between the reviewers for this more objective part of the survey. The second part of the questionnaire required generally more subjective assessments concerning the presentation of data and the quality and appropriateness of the statistical methods used (see Additional file

1 for questionnaire details). The results of this part are reported in detail here. The survey allowed a detailed investigation of issues such as the description of the sample size calculation, missing data, the use of blinding in trials, the experimental unit, multiple testing and presentation of results.

The

*correctness, **robustness, **efficiency * and

*relevance *[

7] of the statistical methods reported in the sample papers were assessed using a yes or no assignment for each characteristic.

*Correctness* refers to whether the statistical method was appropriate. For instance, it is not correct to use an unpaired

*t*-test to compare an outcome from baseline to the trial endpoint for a single group of patients. Many statistical methods rely on a number of assumptions (e.g. normality, independence etc.); if those assumptions are incorrect, the selected method can produce misleading results. In this context we would describe the selected methods as lacking

*robustness*. A statistical method was rated as

*inefficient* if, for example, a nonparametric rather than a parametric method was used for an analysis where data conformed to a known distribution (e.g. using a Mann–Whitney test, rather than a

*t*-test). Finally, an analysis was regarded as

*relevant* if it answered the question posed in the study. For instance, a principal components analysis may be correct and efficient for summarising a multivariate dataset, but may have no bearing on the stated aim of a paper.

The majority of the survey items were objective assessments of quality, e.g. an incorrect method of analysis was used, with a small number of more subjective items, e.g. could a different analysis make a difference to the conclusions?