Home | About | Journals | Submit | Contact Us | Français |

**|**Springer Open Choice**|**PMC5181651

Formats

Article sections

- Abstract
- Introduction
- Error Propagation in the Standard Euclidean Geometry
- Error Propagation on the Simplex
- Simulation-Based Investigations of Error Propagation
- Discussion and Conclusions
- References

Authors

Related links

Mathematical Geosciences

Math Geosci. 2016; 48(8): 941–961.

Published online 2016 July 7. doi: 10.1007/s11004-016-9646-x

PMCID: PMC5181651

Mehmet Can Mert, Phone: +43 1 58801 10560, Email: ta.ca.neiwut@trem.temhem, Email: ta.ca.neiwut@resomzlif.p.

Received 2015 August 27; Accepted 2016 June 15.

Copyright © The Author(s) 2016

Compositional data, as they typically appear in geochemistry in terms of concentrations of chemical elements in soil samples, need to be expressed in log-ratio coordinates before applying the traditional statistical tools if the relative structure of the data is of primary interest. There are different possibilities for this purpose, like centered log-ratio coefficients, or isometric log-ratio coordinates. In both the approaches, geometric means of the compositional parts are involved, and it is unclear how measurement errors or detection limit problems affect their presentation in coordinates. This problem is investigated theoretically by making use of the theory of error propagation. Due to certain limitations of this approach, the effect of error propagation is also studied by means of simulations. This allows to provide recommendations for practitioners on the amount of error and on the expected distortion of the results, depending on the purpose of the analysis.

Compositional data analysis is concerned with analyzing the relative
information between the variables, the so-called compositional parts, of a
multivariate data set. Here, relative information refers to the log-ratio
methodology (Aitchison 1986) and,
therefore, in fact, to an analysis of logarithms of ratios between the compositional
parts. It has been demonstrated that the sample space of compositions is not the
usual Euclidean space, but the simplex with the so-called Aitchison geometry
(Pawlowsky-Glahn et al. 2015). For a
composition ** x** = (

$$\begin{array}{c}\hfill {\mathcal{S}}^{D}=\{\mathit{x}=({x}_{1},\dots ,{x}_{D})\phantom{\rule{1em}{0ex}}\phantom{\rule{0.333333em}{0ex}}\text{such}\phantom{\rule{0.333333em}{0ex}}\phantom{\rule{0.333333em}{0ex}}\text{that}\phantom{\rule{0.333333em}{0ex}}{x}_{j}>0\phantom{\rule{3.33333pt}{0ex}}\forall j,\phantom{\rule{1em}{0ex}}\sum _{j=1}^{D}{x}_{j}=\mathit{\kappa}\}\end{array}$$

for an arbitrary constant *κ*. Nevertheless, according to recent developments, the sample space
of compositional data is even more general (Pawlowsky-Glahn et al. 2015): A vector ** x** is a

$$\begin{array}{c}\hfill \mathcal{C}(\mathit{x})=\left(\frac{\mathit{\kappa}{x}_{1}}{{\sum}_{j=1}^{D}{x}_{j}},\dots ,\frac{\mathit{\kappa}{x}_{D}}{{\sum}_{j=1}^{D}{x}_{j}}\right),\end{array}$$

which, then, sum up to the constant *κ*.

The Aitchison geometry defines a vector space structure of the simplex
by the basic operations of perturbation and powering. Given two compositions
** x** = (

Powering refers to a multiplication of a composition ** x** = (

$$\begin{array}{c}\hfill \mathit{\alpha}\odot \mathit{x}=\mathcal{C}({x}_{1}^{\mathit{\alpha}},{x}_{2}^{\mathit{\alpha}},\dots ,{x}_{D}^{\mathit{\alpha}}).\end{array}$$

Furthermore, the Aitchison inner product, the Aitchison norm, and the
Aitchison distance have been defined, and they lead to a Euclidean vector space
structure (Pawlowsky-Glahn et al. 2015). All these definitions employ log-ratios between the
compositional parts; for instance, the Aitchison inner product between the
compositions ** x** and

$$\begin{array}{c}\hfill {\u27e8\mathit{x},\mathit{y}\u27e9}_{A}=\frac{1}{2D}\sum _{j=1}^{D}\sum _{k=1}^{D}ln\frac{{x}_{j}}{{x}_{k}}ln\frac{{y}_{j}}{{y}_{k}},\end{array}$$

that leads to the Aitchison norm and distance

$$\begin{array}{c}\hfill {||\mathit{x}||}_{A}=\sqrt{{\u27e8\mathit{x},\mathit{x}\u27e9}_{A}},\phantom{\rule{1em}{0ex}}{d}_{A}(\mathit{x},\mathit{y})=||\mathit{x}\oplus (-1)\odot {\mathit{y}||}_{A}\end{array}$$

respectively. Working directly in the simplex sample space is not straightforward. Rather, it is common to express compositional data in the usual Euclidean geometry. In the literature, one frequently refers to transformations; here, it is prefered to use the terminology of expressing the compositions in appropriate coordinates with respect to the Aitchison geometry (Pawlowsky-Glahn and Egozcue 2001) that allows to analyze compositions in the usual Euclidean geometry.

The focus in this paper is on isometric log-ratio (ilr) coordinates
(Egozcue et al. 2003), which allow to
express a composition ** x** ∈ 𝒮

$$\begin{array}{c}\hfill {z}_{j}={\mathrm{ilr}}_{j}(\mathit{x})=\sqrt{\frac{D-j}{D-j+1}}ln\frac{{x}_{j}}{\sqrt[D-j]{{\prod}_{k=j+1}^{D}{x}_{k}}},\phantom{\rule{1em}{0ex}}j=1,\dots ,D-1,\end{array}$$

1

and the coordinates ** z** = (

The definition of ilr coordinates (1) reveals that geometric means of (subsets of) the parts are
involved. Note that the geometric mean of ** x** can also be expressed as

$$\begin{array}{c}\hfill {g}_{m}(\mathit{x})={\left(\prod _{j=1}^{D}{x}_{j}\right)}^{1/D}=\phantom{\rule{0.333333em}{0ex}}\text{exp}\phantom{\rule{0.333333em}{0ex}}\left(\frac{1}{D}\sum _{j=1}^{D}ln{x}_{j}\right)\end{array}$$

involving the arithmetic mean of the log-transformed values. It is well known that the arithmetic mean is sensitive to data outliers (Maronna et al. 2006). Consequently, also data imprecision in one or some compositional parts (that are usually measured without respecting the relative nature of compositional data), or detection limit problems, may act like outliers and lead to a distortion of the geometric mean. The resulting ilr coordinates will suffer from data quality problems, and subsequent analyses based on these coordinates can be biased.

This unwanted effect is investigated here under the terminology of error propagation, where the effect of the errors on the output of a function is analyzed. Propagation of error can be performed by a calculus-based approach, or by simulation studies. A calculus-based approach makes use of the Taylor series expansion and calculates the first two statistical moments of the error of output, the mean and the variance, under the assumption that the errors are statistically independent (Ku 1966). With few exceptions, almost all analyses of error propagation with the calculus-based approach use the first-order Taylor approximation, and neglect the higher order terms (Birge 1939). This approach is briefly reviewed in Sect. 2. Section 3 starts with a motivating example about the effect of the errors on ilr coordinates and applies the concept of Taylor approximation to error propagation in the simplex. While this is done in a general form for any function (transformation), particular emphasis is given to error propagation for ilr coordinates that cause one source of distortion of outputs in practical geochemical problems (Filzmoser et al. 2009b).

Determining error propagation only for the first two moments is unsatisfactory, because it would also be interesting how the data structure is changed in the case of data problems like detection limits or imprecision of the measurements. Thus, simulation-based methods for error propagation are considered as well. The Monte Carlo method is adaptable and simple for the propagation of errors (Feller and Blaich 2001; Cox and Siebert 2006), and various applications of this method can be found (Liu 2008). The simulation-based approach in Sect. 4 makes use of a practical data set and shows the effect of imprecision and detection limit effects on the ilr coordinates. The interest lies particularly in error propagation on the first ilr coordinate, because this contains all relative information about the first compositional part, and on error propagation on all ilr coordinated jointly, because they contain the full multivariate information. The final Sect. 5 discusses the findings and concludes.

Consider a *p*-dimensional random
variable ** x** = (

$$\begin{array}{cc}\hfill y=& f({x}_{1},\dots ,{x}_{p})=f({\mathit{\mu}}_{1}+{\mathit{\u03f5}}_{1},\dots ,{\mathit{\mu}}_{p}+{\mathit{\u03f5}}_{p})\hfill \\ \hfill \approx & f({\mathit{\mu}}_{1},\dots ,{\mathit{\mu}}_{p})+\left[\frac{\mathit{\partial}f}{\mathit{\partial}{x}_{1}}({\mathit{\mu}}_{1})\right]{\mathit{\u03f5}}_{1}+\cdots +\left[\frac{\mathit{\partial}f}{\mathit{\partial}{x}_{p}}({\mathit{\mu}}_{p})\right]{\mathit{\u03f5}}_{p}.\hfill \end{array}$$

2

In the framework of error propagation, it is common to assume that
(*x*_{1}, …, *x*_{p}) follow a known distribution, in most cases, a multivariate normal
distribution (Ku 1966). If the
distribution is known, the partial derivatives are evaluated at the true means, if
not, the sample averages are used for the estimation. The approximation in Eq.
(2) can now be used to calculate mean and
variance of *y*, which both depend on the function
*f*. The second central moment, the variance
Var(*y*), describes the uncertainty, which is
mainly used to investigate the effect of error propagation and is given as

$$\begin{array}{c}\hfill \phantom{\rule{0.333333em}{0ex}}\text{Var}\phantom{\rule{0.333333em}{0ex}}(y)\approx \sum _{j=1}^{p}{\left(\frac{\mathit{\partial}f}{\mathit{\partial}{x}_{j}}({\mathit{\mu}}_{j})\right)}^{2}E({{\mathit{\u03f5}}_{j}}^{2})+\underset{j\ne k}{\sum \sum}\left(\frac{\mathit{\partial}f}{\mathit{\partial}{x}_{j}}({\mathit{\mu}}_{j})\right)\left(\frac{\mathit{\partial}f}{\mathit{\partial}{x}_{k}}({\mathit{\mu}}_{k})\right)E({\mathit{\u03f5}}_{j}{\mathit{\u03f5}}_{k}).\\ \hfill \end{array}$$

3

Equation (3) reveals how the
variability of the output *y* depends on the errors
and on the function *f*.

As a motivating example, the composition of sand, silt, and clay in
agricultural soils in Europe is considered. The data are reported in Reimann et al.
(2014). From the ternary diagram
(Fig. (Fig.1a),1a), it can be seen that the clay
concentrations can be very small, but data artifacts are not immediately visible.
The resulting ilr coordinates *z*_{1} and *z*_{2} are shown in Fig. Fig.1b.
Here,1b.
Here, the small clay values are visible in form of a band that deviates clearly from
the joint data structure. In fact, small values of clay have been rounded in the
laboratory, which causes already a distortion of the multivariate data structure.
Thus, the imprecision here is visible as a rounding effect in the part clay.
Variables with values below a detection limit can result in similar artifacts, since
usually the values below detection are set to some constant, like 2/3 times the
values of the detection limit (Martín-Fernández et al. 2003). This is still the usual practice in
geosciences rather than employing more sophisticated algorithms for their imputation
(Martín-Fernández et al. 2012).

Composition of sand, silt, and clay in agricultural soils of
Europe. Ternary diagram **a**, representation
in ilr coordinates **b**

Similar as in Sect. 2, error propagation is derived for a general function using first-order Taylor approximation. However, since this is directly done on the simplex, also the Taylor approximation needs to be done on the simplex. The theoretical background for the differential calculus on the simplex can be found in Barceló-Vidal and Martín-Fernández (2002) and Barceló-Vidal et al. (2011). Here, the tools necessary to carry out the Taylor approximation are recalled.

Let *f*:*U* → ℝ^{m} be a vector-valued function defined on a subset $U\subset {\mathbb{R}}_{+}^{D}$. Let $\underline{U}=\{\mathcal{C}(\mathit{w}),\mathit{w}\in U\}$, the compositional closure of *U*, be a subset of 𝒮^{D}. If *f* is scale invariant, that
is *f*(** w**) =

$$\begin{array}{c}\hfill \underline{f}(\mathit{x})=f(\mathit{w}),\phantom{\rule{1em}{0ex}}\forall \mathit{w}\in U\text{,}\phantom{\rule{0.333333em}{0ex}}\end{array}$$

where 𝒞(** w**) =

$$\begin{array}{c}\hfill \underset{\mathit{u}\stackrel{\mathcal{C}}{\to}\mathit{n}}{lim}\phantom{\rule{4pt}{0ex}}\frac{\Vert \underline{f}(\mathit{x}\oplus \mathit{u})-\underline{f}(\mathit{x})-\mathit{A}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{0.333333em}{0ex}}\text{ln}\phantom{\rule{0.333333em}{0ex}}\mathit{u}\Vert}{{\Vert \mathit{u}\Vert}_{A}}=0\end{array}$$

for $\mathit{u}\in \underline{U}$, where 1_{D} = (1, …, 1) with length *D*, and
0_{m} = (0, …, 0) with length *m*. Note that
** n** = 𝒞(1, …, 1) is the neutral element of (𝒮

$$\begin{array}{c}\hfill \underline{f}(\mathit{x}\oplus \mathit{u})\approx \underline{f}(\mathit{x})+\sum _{j=1}^{D}\phantom{\rule{0.333333em}{0ex}}\text{ln}\phantom{\rule{0.333333em}{0ex}}({u}_{j})\left[\frac{{\mathit{\partial}}_{\mathcal{C}}\underline{f}}{\mathit{\partial}{x}_{j}}(\mathit{x})\right],\end{array}$$

4

where the 𝒞-derivative of $\underline{f}$ exists and is equal to

$$\begin{array}{c}\hfill \frac{{\mathit{\partial}}_{\mathcal{C}}\underline{f}}{\mathit{\partial}{x}_{j}}(\mathit{x})={x}_{j}\left(\frac{\mathit{\partial}\underline{f}}{\mathit{\partial}{x}_{j}}(\mathit{x})-\sum _{i=1}^{D}{x}_{i}\frac{\mathit{\partial}\underline{f}}{\mathit{\partial}{x}_{i}}(\mathit{x})\right)\phantom{\rule{1em}{0ex}}\phantom{\rule{0.333333em}{0ex}}\text{for}\phantom{\rule{0.333333em}{0ex}}j=1,\dots ,D.\end{array}$$

5

Given a *D*-part composition
** x** = (

$$\begin{array}{c}\hfill \underline{f}(\mathit{\mu}\oplus \mathit{\u03f5})\approx \underline{f}(\mathit{\mu})+\sum _{j=1}^{D}\phantom{\rule{0.333333em}{0ex}}\text{ln}\phantom{\rule{0.333333em}{0ex}}({\mathit{\u03f5}}_{j})\left[\frac{{\mathit{\partial}}_{\mathcal{C}}\underline{f}}{\mathit{\partial}{\mathit{\mu}}_{j}}(\mathit{\mu})\right].\end{array}$$

6

One can proceed as in Sect. 2 to derive the variance of the components of $\underline{f}(\mathit{\mu}\oplus \mathit{\u03f5})$. Similar as for the Taylor expansion (2) from Sect. 2, also here, the approximation is valid just for small perturbations. Moreover, in contrast to the previous case, the error is now multiplicative. Although this fits well with the nature of compositional data, particularly with their scale invariance, in practice, error terms are often additive (van den Boogaart et al. 2015). This fact should be considered for an error propagation analysis of compositional data.

In the case of ilr coordinates, however, the investigation of the
error propagation simplifies. By considering (6) with ilr coordinate *i**l**r*_{i}(** x**) as

$$\begin{array}{c}\hfill \frac{{\mathit{\partial}}_{c}{\mathrm{ilr}}_{i}}{\mathit{\partial}{\mathit{\mu}}_{j}}=\left\{\begin{array}{cc}0\hfill & \phantom{\rule{1em}{0ex}}\phantom{\rule{0.333333em}{0ex}}\text{if}\phantom{\rule{0.333333em}{0ex}}j<i,\hfill \\ \sqrt{\frac{D-i}{D-i+1}}\hfill & \phantom{\rule{1em}{0ex}}\phantom{\rule{0.333333em}{0ex}}\text{if}\phantom{\rule{0.333333em}{0ex}}j=i,\hfill \\ -\sqrt{\frac{D-i}{D-i+1}}\frac{1}{D-i}\hfill & \phantom{\rule{1em}{0ex}}\phantom{\rule{0.333333em}{0ex}}\text{if}\phantom{\rule{0.333333em}{0ex}}j>i,\hfill \end{array}\right.\end{array}$$

7

where *i* = 1, …, *D* - 1. This corresponds exactly to a logcontrast (Aitchison 1986) of the *i*
th ilr coordinate of the compositional error ** ϵ**, and consequently

ilr_{i}(*x*) = ilr_{i}(*μ* ⊕ *ϵ*) = ilr_{i}(*μ*) + ilr_{i}(*ϵ*), *i* = 1, …, *D* - 1.

In the context of error propagation this shows that the ilr coordinates are additive with respect to multiplicative errors. On the other hand, for other forms of errors, a non-linear behavior can be expected. This issue is further investigated within the simulation study in Sect. 4.

In addition, this leads to an alternative verification of the linearity of ilr coordinates

that is commonly shown directly with the definitions from Sect.
1. Even more, ilr coordinates represent an
isometry, which means that all metric concepts in the simplex are maintained after
taking the ilr coordinates (Pawlowsky-Glahn et al. 2015). The variance can now be considered component-wise, for
example for the *j* th component *z*_{j} of ** z** one obtains

Var (*z*_{j}) = Var (ilr_{j}(*x*)) = Var (ilr_{j}(*ϵ*)).

This variance can be expressed by log-ratios of the compositional parts, as shown in Fišerová and Hron (2011) as

$$\begin{array}{cc}\hfill \phantom{\rule{0.333333em}{0ex}}\text{Var}\phantom{\rule{0.333333em}{0ex}}({z}_{j})=& A-B\phantom{\rule{1em}{0ex}}\phantom{\rule{0.333333em}{0ex}}\text{with}\phantom{\rule{0.333333em}{0ex}}\hfill \\ \hfill A=& \frac{1}{D-j+1}\sum _{k=j+1}^{D}\phantom{\rule{0.333333em}{0ex}}\text{Var}\phantom{\rule{0.333333em}{0ex}}\left(ln\frac{{\mathit{\u03f5}}_{j}}{{\mathit{\u03f5}}_{k}}\right),\hfill \\ \hfill B=& \frac{1}{2(D-j)(D-j+1)}\sum _{k=j+1}^{D}\sum _{l=j+1}^{D}\phantom{\rule{0.333333em}{0ex}}\text{Var}\phantom{\rule{0.333333em}{0ex}}\left(ln\frac{{\mathit{\u03f5}}_{k}}{{\mathit{\u03f5}}_{l}}\right).\hfill \end{array}$$

8

The contributions of log-ratio variances in this linear combination are
clearly higher for terms in *A* that include
*ϵ*_{j}, and lower for terms in *B* where
*ϵ*_{j} is not involved, and their magnitude depends on the number of
parts *D*. In particular, if *D* is large and contamination (imprecision, detection
limit problem) is expected only in one compositional part, the effect on the
variance of *z*_{j} will be small. Note, however, that for a multivariate analysis,
the focus is in all coordinates *z*_{1}, …, *z*_{D-1} simultaneously, and thus, it is not so straightforward to
investigate the effect, since there may also be dependencies among the error terms.
There is a simple exception: suppose that an error is to be expected only in
log-ratios with one compositional part. From a practical perspective, it would then
appear that only one compositional part is erroneous. If this part is taken as the
first one, the ilr coordinates from Eq. (1)
will allow to assign this error exclusively to *z*_{1}, but not to the other coordinates.

Besides investigating the variance of the coordinates, it is also important to know how the errors affect distances between different compositions, that is between observations of a compositional data set, and how the multivariate data structure is affected. All these aspects will be investigated in more detail by simulations in the next section.

For a simulation-based analysis of error propagation, a real data set
is used, namely the GEMAS data mentioned in Sect. 1, described in Reimann et al. (2014). More than 2000 samples of agricultural soils have been
analyzed in an area covering 5.6 million km^{2} of Europe across 33 countries, and for the simulations, the
concentrations of the elements Al, Ba, Ca, Cr, Fe, K, Mg, Mn, Na, Nb, P, Pb, Rb, Si,
Sr, Ti, V, Y, Zn, and Zr are considered. Precision or detection limit problems of
these elements are rather small or even not existing (Reimann et al. 2014), and thus, these elements form a good base
for carrying out simulations where contamination is artificially introduced in the
form of imprecision and detection limit problems.

Denote the resulting compositional data matrix by ** X**, where the observations are forming the rows and the
above-mentioned compositional parts the columns. The number of observations is

In the simulations, problems with detection limit and imprecision are reproduced as follows:

*Detection limit*(*DL*) Set all observations*x*_{ij}of the*j*th part to the valuewhere$$\begin{array}{c}\hfill {x}_{ij}^{\ast}=\left\{\begin{array}{cc}\frac{2}{3}\phantom{\rule{0.333333em}{0ex}}\text{DL}{\phantom{\rule{0.333333em}{0ex}}}_{j}\hfill & \phantom{\rule{1em}{0ex}}\phantom{\rule{0.333333em}{0ex}}\text{if}\phantom{\rule{0.333333em}{0ex}}{x}_{ij}\le \phantom{\rule{0.333333em}{0ex}}\text{DL}{\phantom{\rule{0.333333em}{0ex}}}_{j}\hfill \\ {x}_{ij}\hfill & \phantom{\rule{1em}{0ex}}\phantom{\rule{0.333333em}{0ex}}\text{otherwise}\phantom{\rule{0.333333em}{0ex}},\hfill \end{array}\right.\end{array}$$9*i*= 1, …,*n*, and DL_{j}is taken as some quantile of that part.*Imprecision rate*(*IR*) A noise term*ϵ*_{ij}is added to each observation*x*_{ij}, where the noise depends on the magnitude of the observation and follows a uniform distribution. Thus, the values*x*_{ij},*i*= 1, …,*n*, are set towhere$$\begin{array}{c}\hfill {x}_{ij}^{\ast}={x}_{ij}+{\mathit{\u03f5}}_{ij},\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}{\mathit{\u03f5}}_{ij}\sim \mathcal{U}(-{\mathit{\alpha}}_{j}{x}_{ij},{\mathit{\alpha}}_{j}{x}_{ij}),\end{array}$$10*α*_{j}> 0 defines the imprecision rate of the*j*th part, and the resulting simulated value ${x}_{ij}^{\ast}$ must be positive. Note that this contamination is not additive but multiplicative, sinceThus, this contamination scheme corresponds to the error model of the previous section, while contamination by a detection limit introduces a non-linear effect.$$\begin{array}{c}\hfill {x}_{ij}^{\ast}={x}_{ij}(1+{\mathit{\gamma}}_{j}),\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}{\mathit{\gamma}}_{j}\sim \mathcal{U}(-{\mathit{\alpha}}_{j},{\mathit{\alpha}}_{j}).\end{array}$$

As mentioned previously, the main interest is the investigation of
error propagation for ilr coordinates. If the *i*
th row of ** X** is denoted by

As an illustrative example the last ten parts of the composition are
picked, and contaminated with errors. A detection limit problem is imitated, by
choosing DL _{j} as the 0.25-quantile in each of these components, and setting the
values in these parts according to Eq. (9).
The results are shown in the left panels of Fig. Fig.2:2: the upper panel shows the first ilr coordinate of the original
versus the contaminated data. One can see clear distortions in form of deviations
from the main structure, but also in the form of nonlinearities. For a clearer
picture of the multivariate data structure, the Mahalanobis distances of all ilr
coordinates for the original and contaminated data are presented in the lower panel
of Fig. Fig.2.2. The Mahalanobis distance (MD) for
the *i* th composition expressed in coordinates
is

$$\begin{array}{c}\hfill \phantom{\rule{0.333333em}{0ex}}\text{MD}\phantom{\rule{0.333333em}{0ex}}({\mathit{z}}_{i})=\sqrt{{({\mathit{z}}_{i}-{\mathit{t}}_{\mathit{z}})}^{\prime}{\mathit{C}}_{\mathit{z}}^{-1}({\mathit{z}}_{i}-{\mathit{t}}_{\mathit{z}})},\phantom{\rule{1em}{0ex}}\phantom{\rule{0.333333em}{0ex}}\text{for}\phantom{\rule{0.333333em}{0ex}}i=1,\dots ,n,\end{array}$$

11

where *t*_{z} and *C*_{z} are robust estimators of location and covariance of the ilr
coordinates ** Z**, respectively. For reasons of comparability, the Mahalanobis
distances for the contaminated data are computed with the estimators

Effect of the DL and IR contamination on the first ilr coordinate
(**a** and **b**), and on all ilr coordinates jointly (**c** and **d**)

The right panel of Fig. Fig.2
shows2
shows the results of a simulated precision problem. Again, the last ten parts are
contaminated, *α*_{j} is set to 0.25 for these parts, and Eq. (10) is applied. The upper panel compares the first ilr
coordinates for the original and distorted data. Since the contamination is
symmetric in each part, the outcome is also relatively symmetric around the line of
45 degrees. The comparison of the Mahalanobis distance shows that those distances
for the contaminated data increase, in general.

The above example already provides an idea about possible choices of measures for quantifying the resulting error. The focus is on the first ilr coordinate as well as on all coordinates jointly in terms of Mahalanobis distances, and the original data will be compared with the contaminated data.

Denote the values of the first ilr coordinate by *z*_{0.1} = (*z*_{11}, …, *z*_{n1}), and the corresponding contaminated version by ${\mathit{z}}_{0.1}^{\ast}=({z}_{11}^{\ast},\dots ,{z}_{n1}^{\ast})$. The two vectors are compared by:

- Spearman rank correlation, expressed aswhere$$\begin{array}{c}\hfill \phantom{\rule{0.333333em}{0ex}}\text{Cor}{\phantom{\rule{0.333333em}{0ex}}}_{S}({\mathit{z}}_{0.1},{\mathit{z}}_{0.1}^{\ast})=\frac{\phantom{\rule{0.333333em}{0ex}}\text{Cov}\phantom{\rule{0.333333em}{0ex}}(R({\mathit{z}}_{0.1}),R({\mathit{z}}_{0.1}^{\ast}))}{\sqrt{\phantom{\rule{0.333333em}{0ex}}\text{Var}\phantom{\rule{0.333333em}{0ex}}(R({\mathit{z}}_{0.1}))}\sqrt{\phantom{\rule{0.333333em}{0ex}}\text{Var}\phantom{\rule{0.333333em}{0ex}}(R({\mathit{z}}_{0.1}^{\ast}))}},\end{array}$$12
*R*( · ) gives the ranks of its argument vector. - Mean absolute scaled deviation (MASD), defined as$$\begin{array}{c}\hfill \phantom{\rule{0.333333em}{0ex}}\text{MASD}\phantom{\rule{0.333333em}{0ex}}({\mathit{z}}_{0.1},{\mathit{z}}_{0.1}^{\ast})=\frac{1}{n}\sum _{i=1}^{n}\frac{\left|{z}_{i1}-{z}_{i1}^{\ast}\right|}{\sqrt{\phantom{\rule{0.333333em}{0ex}}\text{Var}\phantom{\rule{0.333333em}{0ex}}({\mathit{z}}_{0.1})}}.\end{array}$$13

The Spearman rank correlation coefficient measures the monotone relation between the uncontaminated and contaminated coordinates; a value of one would refer to the same ordering of the values of the coordinates. On the other hand, MASD is more strict and evaluates the error in reproducing the values of the coordinate. Note that the scaling in MASD by the variance is used to allow for a comparison of the corresponding first ilr coordinates if the parts in the data matrix are permuted.

Similar measures for comparison are proposed in the multivariate
case. Denote by MD (** Z**), the vector of the Mahalanobis distances MD (

$$\begin{array}{c}\hfill \phantom{\rule{0.333333em}{0ex}}\text{MASD}\phantom{\rule{0.333333em}{0ex}}(\phantom{\rule{0.333333em}{0ex}}\text{MD}\phantom{\rule{0.333333em}{0ex}}(\mathit{Z}),\phantom{\rule{0.333333em}{0ex}}\text{MD}\phantom{\rule{0.333333em}{0ex}}({\mathit{Z}}^{\mathbf{\ast}}))=\frac{1}{n}\sum _{i=1}^{n}\frac{\left|\phantom{\rule{0.333333em}{0ex}}\text{MD}\phantom{\rule{0.333333em}{0ex}}({\mathit{z}}_{i})-\phantom{\rule{0.333333em}{0ex}}\text{MD}\phantom{\rule{0.333333em}{0ex}}({\mathit{z}}_{i}^{\ast})\right|}{{Q}_{0.5}(\phantom{\rule{0.333333em}{0ex}}\text{MD}\phantom{\rule{0.333333em}{0ex}}(\mathit{Z}))}.\end{array}$$

14

The scaling is done by the 0.5 quantile (median) of the Mahalanobis
distances of ** Z** to allow for comparability of subcompositions with different
numbers of parts. This measure, thus, indicates the error in reproducing the
multivariate data structure. As mentioned previously, the Mahalanobis distances
MD (

These measures have been computed for the example shown in Fig. Fig.22 to get an idea about the meaning of the magnitude of these values. The Spearman rank correlation is in all cases clearly above 0.9, in spite of the deviations of some points. The scaled distances MASD for the first ilr coordinates are lower that those for all coordinates jointly (Mahalanobis distances).

Start with the first column *x*_{0.1} of the composition ** X**, and add step-by-step another column. After the (

The number of simulation replications is 100. In each replication,
the parts of the original composition are permuted. In that way, the first
(uncontaminated) part changes, but also the sequence of the parts that are added
changes. All simulations are done for the contamination in the form of detection
limit (DL) and for imprecision (IR). In the first case, the value
DL_{j} of the detection limit is taken as the 0.25 quantile, see Eq.
(9), while in the latter case, the
imprecision rate is taken as *α*_{j} = 0.25, see Eq. (10).

The results are presented by boxplots in Fig. Fig.3.3. The left panels show the outcome for the detection limit simulations, and the right panels show the results of the imprecision simulations. The upper figures show the comparison of original versus contaminated versions in terms of Spearman correlations, while the lower figures compare in terms of MASD. The grey boxplots compare the first ilr coordinates, while the white boxplots summarize the Mahalanobis distances of all joint coordinates. The plots allow to compare the impact of an increasing number of contaminated parts (horizontal axis). Although the amount of contamination is quite high, the correlations reveal that the covariance structure of the multivariate data is basically preserved. In particular, the comparison of the first ilr coordinates leads to a remarkably high correlation, which is quite stable with an increasing number of parts (for DL), and even improving in the case of IR. This means that additional parts coupled with a symmetric contamination scheme, as in the case of IR, still provide important and useful information that stabilizes the first ilr coordinate. The MASD results for the first ilr coordinate are quite stable in the case of DL, while in the IR case with increasing number of parts an improvement is observed.

One uncontaminated part, and 1–19 contaminated parts added.
Univariate and multivariate structural changes between the original and
contaminated ilr coordinates with increasing number of contaminated parts
in case of DL (**a** and **c**) and IR (**b** and
**...**

The picture is somewhat different when comparing all ilr coordinates jointly. The Spearman correlation is clearly lower, and it gets more stable with an increasing number of parts. In the case of DL, the MASD measure is nearly constant with an increasing number of parts, while for IR first, a decline is observed, but then a clear increase. It is, however, surprising that the Mahalanobis distances do not change more drastically, given that the amount of contamination is relatively high.

In a further simulation experiment, a block of ten compositional parts is fixed and left uncontaminated. Step-by-step, a contaminated part is added, until all ten remaining (contaminated) parts have been included. The comparison is done in the same way as before. The simulation is repeated 100 times, and the parts are randomly permuted for each replication. Thus, the uncontaminated block changes, but also the contaminated parts differ from simulation to simulation. The results are shown in Fig. Fig.44.

Ten uncontaminated parts, and 1 to 10 contaminated parts added.
Univariate and multivariate structural changes between the original and
contaminated ilr coordinates with increasing number of contaminated parts
in the case of DL (**a** and **c**) and IR (**b** and
**...**

Basically, a similar impression can be observed as in Fig. Fig.3.3. For the first ilr coordinates, the correlations are now very close to one, and the values of MASD, although increasing slightly with increasing number of contaminated parts, are close to zero. Therefore, having good data quality for a major part of the data set is a good protection against poor data quality in additional parts—at least for the first ilr coordinate. The multivariate data structure is well maintained in terms of ordering, expressed by the Spearman rank correlations, which are still clearly above 0.9. The MASD values for the Mahalanobis distances now increase for DL as well as for IR, with an increasing number of parts, but they are lower than in the previous simulation.

In the previous simulations, the amount of contamination is fixed.
Here, the effect of changing the amount of contamination is investigated. For that
purpose, ten parts are selected randomly to leave them uncontaminated, while the
remaining ten parts are contaminated by the same amount: in the case of DL
contamination, the value DL_{j} is varied from the 0.05-quantile to the 0.95-quantile; for IR
contamination, the imprecision rate *α*_{j} is varied from 0.05 to 0.95. Note that the imprecision in real
studies can be much higher, in particular for small concentrations (Reimann et al.
2014). Fig. 5 summarizes the outcome of the simulations, where
again 100 replications were performed.

Ten uncontaminated parts, and 10 contaminated parts added.
Univariate and multivariate structural changes between the original and
contaminated ilr coordinates with increasing amount of contamination in
case of DL (**a** and **c**) and IR (**b** and **d**)

The resistance against contamination of the first ilr coordinate is remarkable. Both the correlation and the MASD report relatively small deviations, even for very high amounts of contamination. Contamination according to DL has more effect than that based on imprecision. This is different when looking at the multivariate data structure, expressed by the joint coordinates. The correlations get severely low, and also the MASD increases rapidly. The effect for IR contamination is more severe than that for DL. A MASD value of one means that the average change of the Mahalanobis distances before and after contamination is as large as the median Mahalanobis distance, and thus, this would correspond to a substantial change in the multivariate data structure.

In a final simulation, the effect of the number of observations in
the data set, which has been fixed before with all available observations (i.e.,
more than 2000), is analyzed. As before, ten parts are randomly selected and not
modified, and the remaining ten parts are contaminated at a level of 25 %, that is
for DL contamination 25 % of values below detection limit in each of these parts,
and for IR contamination *α*_{j} = 0.25 for these parts. The results in Fig. Fig.66 for the 100 simulations show that there is no visible effect
for the first ilr coordinate. However, the multivariate structure suffers severely
if the number of observations is smaller than 100.

To many practitioners, it looks almost obvious that geometric means, as they are used in log-ratio approaches, may cause instabilities due to the involved products of the data values. Even worse, measurement errors could be propagated by the use of geometric means. This problem is investigated in more detail, by focusing on the most important log-ratio approach based on ilr coordinates (Pawlowsky-Glahn and Buccianti 2011; Pawlowsky-Glahn et al. 2015).

In a first attempt, the classical theory of error propagation has
been formulated on the simplex, the sample space of compositional data. While this
gets complex if any transformation function would be considered, the results are
straightforward when using ilr coordinates because of their linearity. It has been
shown that the variance of an ilr coordinate is just the variance of the same ilr
coordinate of the random deviations from the center. Using Eq. (8), it can be seen which terms contribute by which
magnitude to this variance. For non-linear contamination schemes, these variance
contributions cannot be computed from the random errors, but they have to be
computed directly from the ilr coordinate. This has been done for the simulation
scheme outlined in Sect. 4.1 for the first
ilr coordinates *z*_{1} of the uncontaminated data, the data contaminated by a detection
limit, and contaminated by the imprecision rate. The resulting variance
contributions are shown in Fig. Fig.77 in the
form of ratios *A* / *B* according to Eq. (8) as
non-colored boxplots. With increasing number of parts, the term *B* (which does not involve variance contributions with
log-ratios to *x*_{1}) gets more dominant. This can be seen in the uncontaminated case,
as well as in the contaminated cases due to the inherent variability contained
within the log-ratios of the remaining parts. Interestingly, detection limit
contamination has almost no effect on the variance contributions *A* and *B* when compared
to the uncontaminated case. This is also shown by the dark boxplots which represent
the ratios of *A*-contaminated to *A*-uncontaminated. Only for contamination by the
imprecision rate, the variance contributions are clearly higher compared to the
uncontaminated case if the number of contaminated parts is low. For higher numbers
of contaminated parts, the variance contributions are about the same.

Further investigations have been carried out through simulation experiments. The contamination is studied in terms of mimicking a detection limit problem, and in terms of imprecision in form of a multiplicative factor. In all experiments it turned out that the structure of the first ilr coordinate can almost not be destroyed with poor data quality, except in the case of extremely high amounts of contamination. This is an interesting outcome, since due to the proposed formula (1) to derive the ilr coordinates, the first coordinates describes all relative information about the first compositional part (Fišerová and Hron 2011). Clearly, if the main interest is not in the first, but in another part, then this part is simply put to the first position. Note that the first coordinate is proportional to the corresponding centered log-ratio (clr) coefficient (Aitchison 1986) for this part (Fišerová and Hron 2011). Practitioners often explore just the structure of the resulting clr coefficients. For example, one can study the clr coefficients for the different chemical elements in maps, which is the compositional alternative to the traditional maps based on the absolute concentrations. Examples are shown in Reimann et al. (2014).

It is not studied, how the contamination of the first part effects
the first ilr coordinate (*z*_{1}), because it is clear that the contamination would be immediately
reflected in the first ilr coordinate, and any additional contamination in other
parts would make things worse. Hence, variations of *z*_{1} are only due to variations of (*x*_{2}, …, *x*_{D}). It is, therefore, quite logical that the impact of DL or IR on
*z*_{1} remains limited, and that its growth decreases as *D* increases, due to compensation effects when computing
*g*_{m}(*x*_{2}, …*x*_{D}).

Especially when applying multivariate statistical methods, such as principal component analysis or discriminant analysis, all ilr coordinates have to be analyzed jointly. Therefore, the effect of errors on the multivariate data structure is also investigated in the simulations. It depends very much on the setting if the multivariate data structure is destroyed by the contamination or not. If dimension increases, the effects of the contamination generally increase. It depends a lot on the contamination level if the multivariate data structure after contamination is still closely related to that before, but this also depends on the sample size of the data: higher numbers of observation (e.g., at least 100 in the data set used here) stabilize the results.

Consider again the example shown in Fig. Fig.2,2, where 10 parts out of 20 have been contaminated at a level of 25 %. Here, the DL contamination scheme is considered. Figure Figure88 shows the biplots for the first two principal components (PCs): left panels for the uncontaminated data, right panels for the contaminated data. A comparison is also done with robust PCs (Filzmoser et al. 2009a), which are shown at the lower panels. While there is almost no difference visible between the uncontaminated and contaminated versions, there is a clear difference in the outcome of classical and robust principal component analysis. This shows that, although the MASD is around 0.18 (Fig. (Fig.2c),2c), the outliers that are present in the data have a much stronger effect than the artificial contamination used here.

The overall conclusion of this paper is not that one does not have to care anymore about data quality issues. In contrary, good data quality is the basis of any sound statistical analysis. Rather, it should provide an answer to researchers who have a data set available, and who carefully think about which compositional parts to include in the analysis. Often, it is known which parts have precision problems, and sometimes even the level of imprecision is known. In addition, the amount of values below detection is known. Including such parts with moderate quality in the analysis will in general not have a major effect on a single (the first) ilr variable, and the effects will also be limited, in general, for the multivariate data structure.

The point why one should consider including as much information as possible in the analysis is because the reliable values of such parts with moderate data quality also contribute to the log-ratio analysis, and they might contain important and relevant information.

This work has been partly funded by the Austrian Science Fund (FWF), Project I 1910-N26, by the Grant COST Action CRoNoS IC1408, and by the K-project DEXHELPP through COMET—Competence Centers for Excellent Technologies, supported by BMVIT, BMWFI, and the province Vienna. The COMET program is administrated by FFG.

Mehmet Can Mert, Phone: +43 1 58801 10560, Email: ta.ca.neiwut@trem.temhem, Email: ta.ca.neiwut@resomzlif.p.

Karel Hron, Phone: +42 0 585 634 605, Email: zc.manzes@knorh.

- Aitchison J. The statistical analysis of compositional data. London: Chapman & Hall; 1986.
- Barceló-Vidal C, Martín-Fernández JA. Differential calculus on the simplex. Terra Nostra. 2002;3:393–398.
- Barceló-Vidal C, Martín-Fernández JA, Mateu-Figueras G. Compositional differential calculus on the simplex. In: Pawlowsky-Glahn V, Buccianti A, editors. Compositional data analysis: theory and applications. Chichester: Wiley; 2011. pp. 176–190.
- Birge RT. The propagation of errors. Am J Phys. 1939;7(6):351–357. doi: 10.1119/1.1991484. [Cross Ref]
- Cox MG, Siebert BRL. The use of a Monte Carlo method for evaluating uncertainty and expanded uncertainty. Metrologia. 2006;43(4):S178. doi: 10.1088/0026-1394/43/4/S03. [Cross Ref]
- Egozcue JJ. Reply to “On the Harker variation diagrams;..” by J. A. Cortés. Math Geosci. 2009;41(7):829–834. doi: 10.1007/s11004-009-9238-0. [Cross Ref]
- Egozcue JJ, Pawlowsky-Glahn V, Mateu-Figueras G, Barceló-Vidal C. Isometric logratio transformations for compositional data analysis. Math Geol. 2003;35:279–300. doi: 10.1023/A:1023818214614. [Cross Ref]
- Feller SE, Blaich CF. Error estimates for fitted parameters: application to hcl/dcl vibrational-rotational spectroscopy. J Chem Educ. 2001;78(3):409. doi: 10.1021/ed078p409. [Cross Ref]
- Filzmoser P, Hron K, Reimann C. Principal component analysis for compositional data with outliers. Environmetrics. 2009;20:621–632. doi: 10.1002/env.966. [Cross Ref]
- Filzmoser P, Hron K, Reimann C. Univariate statistical analysis of environmental (compositional) data: problems and possibilities. Sci Total Environ. 2009;407:6100–6108. doi: 10.1016/j.scitotenv.2009.08.008. [PubMed] [Cross Ref]
- Fišerová E, Hron K. On interpretation of orthonormal coordinates for compositional data. Math Geosci. 2011;43(4):455–468. doi: 10.1007/s11004-011-9333-x. [Cross Ref]
- Ku HH (1966) Notes on the use of propagation of error formulas. J Res Natl Bureau Stand 70
- Liu JS (2008) Monte Carlo Strategies in Scientific Computing. Springer, New York
- Maronna R, Martin D, Yohai V. Robust statistics: theory and methods. Toronto: John Wiley & Sons Canada Ltd.; 2006.
- Martín-Fernández JA, Barceló-Vidal C, Pawlowsky-Glahn V. Dealing with zeros and missing values in compositional data sets using nonparametric imputation. Math Geol. 2003;35(3):253–278. doi: 10.1023/A:1023866030544. [Cross Ref]
- Martín-Fernández JA, Hron K, Templ M, Filzmoser P, Palarea-Albaladejo J. Model-based replacement of rounded zeros in compositional data: classical and robust approaches. Comput Stat Data Anal. 2012;56(9):2688–2704. doi: 10.1016/j.csda.2012.02.012. [Cross Ref]
- Pawlowsky-Glahn V, Buccianti A. Compositional data analysis: theory and applications. Chichester: Wiley; 2011.
- Pawlowsky-Glahn V, Egozcue JJ. Geometric approach to statistical analysis on the simplex. Stoch Environ Res Risk Assess. 2001;15(5):384–398. doi: 10.1007/s004770100077. [Cross Ref]
- Pawlowsky-Glahn V, Egozcue JJ. Blu estimators and compositional data. Math Geol. 2002;34(3):259–274. doi: 10.1023/A:1014890722372. [Cross Ref]
- Pawlowsky-Glahn V, Egozcue JJ, Tolosana-Delgado R. Modeling and analysis of compositional data. Chichester: Wiley; 2015.
- Reimann C, Birke M, Demetriades A, Filzmoser P, O’Connor P. Chemistry of Europe’s agricultural soils—part A: methodology and interpretation of the GEMAS data set. Hannover: Schweizerbarth; 2014.
- Rousseeuw PJ, Van Driessen K. A fast algorithm for the minimum covariance determinant estimator. Technometrics. 1999;41(3):212–223. doi: 10.1080/00401706.1999.10485670. [Cross Ref]
- van den Boogaart K, Tolosana-Delgado R, Templ M. Regression with compositional response having unobserved components or below detection limit values. Stat Model. 2015;15(2):191–213. doi: 10.1177/1471082X14535527. [Cross Ref]

Articles from Springer Open Choice are provided here courtesy of **Springer**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |