Classically complexity is defined as the amount of information required to specify the contents of a message [84
]. An historical review is given in Li and Vitányi [[161
] Section 1.6]. This definition can be operationalized by building an instruction set that can generate the message. The complexity of the message is defined to be the length of the instruction set. This operationalization is implemented in the context free grammar complexity [60
]. A systematic procedure (outlined below) is used to construct an algorithm that can reconstruct the original message. The size of the algorithm (also defined below) is the complexity of the message. It is understood that this gives an upper bound to complexity. It is always possible that an alternative construction will give a smaller instruction set. This is true of all complexity measures in this category (randomness finding, nonprobabilistic, model based, [73
]). Because the procedure used to construct the algorithm is systematic, complexity is valid as a comparative measure. This consideration also indicates why it is useful to have the results of a complexity analysis confirmed by the application of a second measure. "There is no single value of complexity. These calculations provide a systematic procedure for obtaining an empirical measure of dynamical behavior that can be compared across conditions." [163
The procedure for determining the context free grammar complexity is best introduced by considering a specific example. Consider the previously introduced message M2
The procedure begins with a search for repeated pairs. In this message, the pair AD is the most repeated pair. It is replaced by the new symbol α = AD.
BC is the next most frequently repeated pair. It is replaced by symbol β = BC.
BB is repeated twice, but as will be seen replacing a pair of symbols with a new symbol does not result in a decrease in the size of the instruction set if the pair is only repeated twice. The search for repeated pairs therefore ends, and the search for repeated triples begins. The triple BBD is repeated twice. In the case of triples, replacing a repeated triple does decrease the size of the instruction set even if it is only repeated twice. BBD is replaced by γ
There are no other repeated triples. In the general case, the search for repeated triples is following by a sequential search for repeated n-tuples, n = 4, 5, 6.... until the search is exhausted. In the case of this message there are no higher order repeats. The compression has converged.
In the next step product terms are replaced by exponentials. Thus AA is replaced by A2
. CCC is replaced by C3
and BB is replaced by B2
. The instruction set to reconstruct the original message is:
Complexity is determined by calculating the size of this instruction set. Under the definition used here [60
] each symbol adds one to the complexity and exponents contribute logarithmically (base 2).
The total is 27.585. Again under this definition, the integer part of the final sum is reported in bits. The context free grammar complexity of M2 is 27 bits.
Estimating the uncertainty in the complexity of a specific message is problematic when the message is considered in isolation and a large population of messages generated by the identical dynamical process is not available. In Rapp, et al. [164
] the uncertainty in CORIG
is approximated by finding the difference in complexity values obtained in the first half and the second half of the message and expressing this difference as a fraction of their average value. Let CA
be the complexity of the first half of the message. Let CB
be the complexity of the second half of the message. Under this approximation, the uncertainty in CORIG
is given by
where we use the property CA and CB are positive.
This procedure can give an aberrant value of zero when CA
. An alternative procedure for estimating ΔCORIG
can be constructed by calculating <C1/2
>, the mean value of complexity calculated from all possible substrings of length LM
/2, and σ1/2
the standard deviation of that mean. Expressed as a fraction, uncertainty is σ1/2
>, and ΔCORIG
is given by
This procedure is, however, computationally insupportable for longer messages. Suppose LM = 8000. This procedure for estimating <C1/2> would require averaging 4000 values of complexity calculated from strings of length LM/2 = 4000. We have adopted the procedure of calculating <C1/2> from 100 strings of length LM/2. They are selected randomly from the set of all possible LM/2 substrings. In cases where LM < 200, <C1/2> is calculated from all possible substrings of length LM/2.
A qualitative understanding of the complexity of a symbols sequence can be obtained by applying these measures to symbol sequences generated by standard systems that are commonly examined in dynamical systems theory. Five examples are considered here: a constant sequence (the same symbol is repeated), sequences generated by the Rössler and Lorenz systems (both three dimensional ordinary differential equations), the Hénon system (a two dimensional difference equation) and a random number generator. The technical specifications of the systems are given in Appendix Three. The Rössler, Lorenz, Hénon and random data are expressed as real variables. In order to apply a symbolic dynamics-based measure of complexity, it is necessary to project these data sets to a discrete symbol set. There are several possible procedures for doing this. Radhakrishnan, et al. [165
] used K-means clustering. While conceptually attractive, the results of K-means clustering can be very sensitive to initial conditions. Bradley and Fayyad [166
] addressed this sensitivity by constructing a K-means algorithm that produces a refined initial condition that improved performance. Insofar as we know, this method has not been applied to the problem of converting real data to symbolic data. An alternative approach has been published by Hirata, et al. [167
] who approximate a generating partition from a time series using tessellations. This is a computationally demanding procedure and there are practical issues concerning the sensitivity of the partition on the initialization. Steuer, et al [86
] recommend using the partition that maximizes entropy. In the present examples, the continuous variable time series is partitioned about the median. In this process, the median is computed from the original time series. A real variable is replaced by symbol '0' if it is less than the median and by symbol '1' if it is greater than or equal to the median. The choice of the median rather than the mean is critical to this process. False-positive indications of deterministic structure in random data can result if the mean is used [168
]. The partitioning process is depicted in Figure . (It should be noted that in the present paper, the consideration of partitioning protocol only applies to the didactic examples presented in the appendices. The psychotherapy data are symbolic and partitioning is not required).
Figure 9 Partitioning real data onto a discrete symbol set. The median is determined form the original data. In this example, real variables are replaced by the symbol '0' if they are less than the median and by symbol '1' if they are greater than or equal to (more ...)
The grammar complexity values computed from one thousand element symbol sequences generated by these model systems are shown in Figure . The results are seen to be consistent with our qualitative understanding of complexity. The constant sequence gives the lowest value, and the random number generator produces the largest value. The ordering Rössler less than Lorenz, less than Hénon is also consistent with expectations based on a visual examination of the time series in the left column of Figure .
Grammar complexity for one thousand element symbol sequences generated by the model systems. Symbol sequences were generated by the partitioning procedure outlined in Figure 9.
A critical distinction must be made between the complexity of a message, CORIG and the intrinsic complexity of the process that generated the message. The value of grammar complexity will depend on two factors, the complexity of the dynamical process generating the symbol sequence and the length of the symbol sequence. This is seen in the upper panel of Figure where grammar complexity is plotted as a function of the length of the data set. The ordering of complexity values seen with 1000 element sequences in Figure is preserved (random > Hénon > Lorenz > Rössler > constant) and the values increase with the size of the data set. It is therefore necessary to find an effective normalization of complexity values that allows comparison of intrinsic complexities without the complication of data set size.
Figure 11 Grammar complexity and normalized complexity as a function of data set size. Symbol sequences were generated from the previously defined model systems using a binary partition about the median. The upper panel shows the grammar complexity. The lower panel (more ...)
It might be supposed that dividing CORIG
by the length of the message is an acceptable solution. It has been shown that this is not the case [169
]. An effective normalization of CORIG
can be achieved by comparing it against the values of complexity obtained from random equiprobable messages of the same length. Let Nα
be the size of the symbol alphabet (the number of distinct symbols available for message construction, in these examples Nα
= 2). Nα
is not message length LM
. An equiprobable surrogate is one where each symbol appears with probability 1/Nα
. Let <CS
> denote the average value of complexity obtained from random equiprobable surrogates of length LM
(the subscript s denotes a surrogate). The normalized complexity is defined by
CN ranges from close to zero for messages containing a single repeated symbol to close to one for messages generated by random processes.
As outlined in Rapp [170
cannot be formed by normalizing against random shuffle surrogates. Consider the case of a message that consists of a single repeated symbol selected from an alphabet of size Nα
> 1. (Sequences in a message space of Nα
= 1 consist of a single symbol and only differ by length. Trivially, their complexity is the number of bits required to encode length LM
.) An effective normalization should give a low value of CN
to a repeated symbol message. Suppose surrogates were formed by a random shuffle. Since the message contains only one symbol, they all have the same value of complexity. In this case, CORIG
> and hence CN
= 1, which is the complexity of a random message. If instead surrogates are equiprobable on Nα
> 1, then <CS
> is greater than CORIG
has a low value. A low value of complexity is expected for a constant sequence.
The uncertainty in CN
, can be estimated by the following argument
The estimation of ΔCORIG has been discussed. <CS> is the mean complexity computed from a distribution of equiprobable surrogates. Δ<CS> is the standard deviation of that distribution.
Comparison of CORIG and the complexity values obtained with surrogates makes it possible to address the following surrogate null hypothesis:
As assessed by this complexity measure, the sequential structure of the original message is indistinguishable from the sequential structure of an equi-probable, random sequence of the same length constructed from the same symbol alphabet.
Several statistical tests of the null hypothesis have been considered [168
]. We use here the Monte Carlo probability of the null hypothesis.
NSURR is the number of surrogates computed. The number of complexity values tested in the numerator includes the complexity of the original symbol sequence as well as the complexity values obtained with surrogates, ensuring that the numerator has a value of at least one. This statistical test was chosen because it is a distribution-agnostic test, that is, it makes no assumptions about the structure of the CSurrogate distribution. Surrogates have a random structure, and the grammar complexity gives the highest value to random sequences. We therefore expect the values of CSurrogate to be greater than the value of CORIG if a nonrandom structure is present in the original sequence. The smallest value of PNULL will be obtained when all values of CSurrogate are greater than CORIG. That minimum value is therefore 1/(1 + NSURR). In the calculations in Figures and , NSURR = 499 and CSurrogate > CORIG in all cases for the constant sequence, Rössler, Lorenz and Hénon data. The null hypothesis is therefore rejected with PNULL = .002. As expected, the null hypothesis is not rejected by symbol sequences produced by a random number generator. For the ten cases in Figure corresponding to LM = 1000, 2000, ... 10,000, the average value of PNULL obtained with random data is PNULL = .562
] and Hope [172
] have proposed a nonparametric test for rejecting the null hypothesis. Under their criterion the null hypothesis is rejected if CORIG
for all of the surrogates. If this criterion is met, as it is in these calculations, the probability of the null hypothesis is again PNULL
= 1/(1 + NSURR
As outlined in Watanabe, et al. [163
] reported values of complexity obtained with real variable data requires the specification of:
1. the complexity measure used,
2. the number of symbols in the alphabet,
3. the procedure used to partition values onto the symbol set,
4. the procedure used to generate the surrogates used to calculate CN,
5. the number of surrogates used, and
6. the statistical procedure used to calculate the probability of the surrogate null hypothesis.