Immunological parameters are hard to measure. A well-known problem [1
] is the occurrence of values below the detection limit, the non-detects. In a project that we will use as an example in this paper, depending on the parameter, more than half of the data, concentrations of soluble biological markers in human blood, consists of non-detects.
Non-detects (NDs) are a nuisance in statistical analysis. An ad-hoc solution is to fill in values for the NDs, e.g. one half of the detection limit. This may be acceptable if only a few per cent of the observations are NDs. If there are many of them, estimated values of means, standard errors and trend lines will be unreliable and conclusions may be wrong.
NDs occur in many places in science and technology. They have received a lot of attention in the work of Helsel [2
]. Although NDs are extremely common in immunology, the literature about them is not very extensive. An exception is the paper by Uh et al. [1
] that studies a number of approaches to analyse datasets with NDs. In that paper quantile regression was not considered. We believe it to be a very useful tool, and like to share our experiences in this expository paper.
Most statistical methods develop a model for the expected values of the observations. In an analysis of variance (ANOVA) these will be the mean values for different groups. In the case of the regression line y=ax+b,
the parameters a
allow us to compute the expected value of an observation y
for every x, which might be age or time or another covariate, that we are interested in. In addition, we can compute prediction intervals, in which a new observation will lie with a specified probability. This type of model belongs to the standard toolbox that most applied scientists learn these days in their statistics lessons. Modern statistical packages make it very easy to use them in practice.
Regression and ANOVA (which is a special case of regression), use the so-called principle of least squares: parameters like a and b in the example above, are computed in such a way that the sum of the squares of the residuals is minimized. The residuals are the differences between expected values, according to the model, and the observations. If a part of the observations is wrong, because of many NDs, the parameter estimates will be (very) wrong.
In this paper we propose to use quantile regression instead of the usual linear regression models. A simple example is provided by ANOVA. Instead of computing means per groups, one could compute the medians, also known as P50, the 50th percentile. A familiar recipe for computing the median of a set of numbers is to sort them from low to high and pick the middle number in the sorted list. Half of the data will be below this number and the other half will be above it. The key point is that the actual values of the lowest observations play no role: what matters is that they are lower than the median. So if we would have 30% NDs and gave them small values, the computed median would still be the same.
If more than 50% of the observations are NDs, but less than 75%, we are still able to compute the P75, the number below which 75% of the data are found. In ANOVA we can still compare P75 in the different groups and look for interesting differences.
For a regression line, the sorting recipe will not work. However, in the last two decades a very useful generalization of regression modelling has become available, quantile regression. With this method we can estimate regression lines, which allow us to compute for y a percentile of our choice for any value of x. The only condition is that all NDs lie below the line. With many NDs, as in our example data set, this means that it is not possible to compute a line for the median, but that the P75 is sufficient.
The outline of the paper is as follows. First, we introduce quantile regression. We have tried to limit the amount of technical material, keeping in mind the expected statistical level of our audience. We also show in this section how the required computations can be done relatively easily with the R system and the package quantreg
]. Then we apply quantile regression to a real data set, with an extremely high number of NDs. The paper ends with a short discussion.