|Home | About | Journals | Submit | Contact Us | Français|
Reservoir computing (RC) is a recent paradigm in the field of recurrent neural networks. Networks in RC have a sparsely and randomly connected fixed hidden layer, and only output connections are trained. RC networks have recently received increased attention as a mathematical model for generic neural microcircuits to investigate and explain computations in neocortical columns. Applied to specific tasks, their fixed random connectivity, however, leads to significant variation in performance. Few problem-specific optimization procedures are known, which would be important for engineering applications, but also in order to understand how networks in biology are shaped to be optimally adapted to requirements of their environment. We study a general network initialization method using permutation matrices and derive a new unsupervised learning rule based on intrinsic plasticity (IP). The IP-based learning uses only local learning, and its aim is to improve network performance in a self-organized way. Using three different benchmarks, we show that networks with permutation matrices for the reservoir connectivity have much more persistent memory than the other methods but are also able to perform highly nonlinear mappings. We also show that IP-based on sigmoid transfer functions is limited concerning the output distributions that can be achieved.
Recurrent loops are abundant in the neural circuits of the mammalian cortex. Massive reciprocal connections exist on different scales, linking different brain areas, as well as connecting individual neurons in cortical columns. In these columns as many as 80% of the synapses of neocortical interneurons form a dense local network (Maass and Markram, 2006) using very specific connectivity patterns for different neuron types (Douglas et al., 2004). These recurrent microcircuits are very stereotypical and repeated over the entire neocortex (Silberberg et al., 2002).
Two challenges for computational models of the neocortex are (a) explaining how these stereotypical microcircuits enable an animal to process a continuous stream of rapidly changing information from its environment (Maass et al., 2002), and (b) how these circuits contribute to the prediction of future events, one of the critical requirements for higher cognitive function (Maass and Markram, 2006).
To address these challenges, a mathematical model for generic neural microcircuits, namely, the liquid state machine (LSM), was proposed by Maass et al. (2002). The framework for this model is based on real-time computation without stable attractors. The neural microcircuits are considered as dynamical systems, and the time-varying input is seen as a perturbation to the state of the high-dimensional excitable medium implemented by the microcircuit. The neurons act as a series of nonlinear filters, which transform the input stream into a high-dimensional space. These transient internal states are then transformed into stable target outputs by readout neurons, which are easy to train (e.g., in order to do prediction or classification of input signals) and avoid many of the problems of more traditional methods of recurrent neural network training such as slow convergence and vanishing gradients [first described in Hochreiter (1991), see also Bengio et al. (1994) andHochreiter et al. (2001)]. This approach to neural modeling has become known as reservoir computing (see Verstraeten et al. (2007) and Lukosevicius and Jaeger (2007) for reviews), and the LSM is one particular kind of model following this paradigm (Table (Table11).
Echo state networks (ESN) (Jaeger, 2001a; Jaeger and Haas, 2004) are another reservoir computing model similar to LSM. They implement the same concept of keeping a fixed high-dimensional reservoir of neurons, usually with random connection weights between reservoir neurons small enough to guarantee stability. Learning procedures train only the output weights of the network to generate target outputs, but while LSM use spiking neuron models, ESN are usually implemented with sigmoidal nodes, which are updated in discrete time-steps. We use ESN in all the experiments of this study for their better analytical tractability. An illustration of the architecture of ESN is given in Fig. Fig.1.1. Please refer to the Materials and Methods section for a more detailed introduction to the mathematical model.
One problem of the reservoir computing approach is that there is considerable variation when different random reservoir initializations are used with all other network parameters remaining fixed (Ozturk et al., 2007). There has been early work on unsupervised optimization of recurrent neural networks by Hochreiter and Schmidhuber (1997a) and Klapper-Rybicka et al. (2001), and several reservoir specific optimization methods have been presented [e.g., Rad et al. (2008); Schmidhuber et al. (2007, 2005); Schrauwen et al. (2008); Steil (2007); Wierstra et al. (2005)]; however, as of yet there is no universally accepted standard method. We study the effect of approaches aiming to reduce this dependency on a random initialization. In order to do this, we investigate two different approaches: the first one is based on a very specific initialization of the reservoirs of ESNs (Hajnal and Lörincz, 2006; White et al., 2004), and the second one implements a pretraining phase for the reservoirs using a local adaptation mechanism. This mechanism models changes in a neuron’s intrinsic excitability in response to stimulation. Specifically, it models the phenomenon that biological neurons lower their firing threshold over time when they receive a lot of stimulation, and raise it when stimulation is lacking (see, e.g., Zhang and Linden (2003) and Daoudal and Debanne (2003) for more details). A learning rule based on this phenomenon was introduced by Triesch (2005), further studied in (Triesch, 2007), and was termed intrinsic plasticity (IP). The reservoir initialization method is based on the idea of optimally exploiting the high dimensionality of the reservoir. The methods based on IP, while also using high-dimensional reservoirs, aim to adapt the reservoir for a high entropy of codes. Moreover, we investigate an IP-based learning rule for high sparsity of codes as these have been shown to improve information processing (Field, 1994).
We evaluate the different reservoir shaping and initialization methods using three different standard benchmarks. We find that reservoirs initialized with orthogonal column vectors in their connectivity matrix exhibit superior short-term memory capacity, and are also able to perform well in tasks requiring highly nonlinear mappings. Furthermore, we identify a problem with an existing IP-based rule and point out limitations of the approach if traditional neuron models are used. We discuss our contributions in the wider context of recurrent neural network research, and review possible solutions to the limitations that are identified as consequences of our results in this study.
We first present an IP-based learning rule that changes a neurons output distribution to approximate a Laplace distribution. Consequently, we present results for the evaluation of ESN with random reservoirs, reservoirs using permutation matrices, and reservoirs pretrained with IP for Gaussian and Laplace distributions, respectively.
IP learning was introduced in (Triesch, 2005) as a way to improve information transmission in neurons while adhering to homeostatic constraints like limited energy usage. For a fixed energy expenditure (represented by a fixed mean of the neurons output distribution), the distribution that maximizes the entropy (and therefore the information transmission) is the exponential distribution. This was used for single neurons in Triesch (2005) and for neurons in a reservoir in Steil (2007) where it led to a performance improvement over standard random reservoirs. In Schrauwen et al., (2008), IP learning for a Gaussian output distribution of reservoir neurons was investigated, which is the maximum entropy distribution if, in addition to the mean, the variance of the output distribution is fixed. Again, an overall increase in performance was noted for several benchmark problems.
A Laplace distribution would lead to sparser codes than the Gaussian, and our hypothesis is that enough entropy would be preserved for a good input signal approximation. Researching Laplace output distributions was also suggested in Schrauwen et al. (2008) for similar reasons. Here, analogous to the calculations in Schrauwen et al. (2008) andTriesch (2005), we derive an IP learning rule for this distribution to test our hypothesis.
In order to model the changes in intrinsic excitability of the neurons, the transfer function of our neurons is generalized with a gain parameter a and a bias parameter b,
The Laplace distribution, which we desire as the reservoir neurons’ output distribution, is defined as
Let denote the sampled output distribution of a reservoir neuron and let the desired output distribution be p(y), thus, p(y)=f(yμ,c). In the learning process, we try to minimize the difference between and p(y), which can be measured with the Kullback–Leibler divergence DKL. Thus, we try to minimize
Taking derivatives of this function with respect to the gain and bias parameters of the neurons’ transfer function (see the Materials and Methods section for details), we arrive at the following learning rules for stochastic gradient descent with learning rate η:
In order to compare the different reservoir shaping and initialization methods, we tested them on three different standard benchmarks (cf. Materials and Methods section for details). The first set of experiments evaluated the short-term memory capacity (MC) of the different networks. In addition, we evaluated the networks on the task of modeling a 30th order nonlinear auto regressive moving average (NARMA) system, and with respect to their one-step prediction performance on the Mackey–Glass time-series. These tasks cover a reasonably wide spectrum of tests for different useful properties of reservoirs and are widely used in the literature, e.g., in Jaeger (2001a, 2001b), Rad et al. (2008), Schrauwen et al. (2008), and Steil (2007).
We tested ESN with four different conditions for the connectivity matrix of the reservoir. In condition RND, the reservoir matrix was initialized with uniform random values between [−1;1]. Condition PMT tested a permutation matrix for the reservoir connectivity. Finally, we used IP optimization with a Gaussian distribution [cf. Schrauwen et al. (2008)] in IPGAUSS and a Laplace distribution (as described in the Materials and Methods section) in IPLAP. In all conditions, the reservoirs were scaled to have a spectral radius of 0.95. In the case of the IP methods, this scaling was done once before the IP training was started.
The results of the experiments are given in Table Table2,2, averaged over 50 simulation runs for each of the four conditions. The networks in the PMT condition essentially show double the memory capacity of networks in the other conditions, while networks pretrained with IPGAUSS and IPLAP have very similar values and show a slight increase compared to condition RND. Figure Figure22 shows plots of the individual MCk curves (see Materials and Methods for their definition) for all conditions in the MC task, with the curve for PMT showing much longer correlations than all the others. The results for the NARMA modeling task are less pronounced, but look similar in that the PMT networks perform better than the other tested conditions. The normalized root mean squared error (NRMSE) for IPLAP and IPGAUSS is very similar again; however, only IPGAUSS has a slight advantage over RND. For the Mackey–Glass one-step prediction, the performance of the IPLAP networks is better than the other ones, slightly ahead of RND (difference well within the standard deviation of the error, however). The PMT networks perform worst on this task.
The superior performance of networks in PMT on the short-term memory task could be expected: networks with a connectivity based on permutation matrices form are a particular instance of orthogonal networks, which, in turn, can be seen as a multidimensional version of a delay line (for details, see the Materials and Methods section). From a system theory perspective, the eigenvalues (poles) of the linearized system implemented by the network correspond to bandpass filters with center frequencies according to their angle in the complex plane (Ozturk et al., 2007). Larger eigenvalues will lead to longer time constants for the filters, preserving information for longer time in the network. Figure Figure3a3a shows that the eigenvalues of the connectivity matrix of a 100 node ESN in the PMT condition are all of the same magnitude, and are spaced relatively uniformly just below the unit circle (the reservoir matrix was scaled to have a maximum absolute eigenvalue of 0.95 in this case, i.e., the matrix elements were either 0 or 0.95). The filters implemented by this network will thus have long time constants and provide support for many different frequencies in order to reconstruct the input signal. Compare this to the distribution of the eigenvalues of a connectivity matrix of an equally sized RND network in Fig. Fig.3b:3b: they are much less uniformly distributed and have very different magnitudes, resulting in a mixture of both longer and shorter time constants for the network.
The PMT networks also outperform the other methods on the highly nonlinear NARMA task, which is less obvious. The NARMA task needs long memory, which the orthogonal reservoirs in PMT are able to provide; but one might suspect that the specific rather sparse connectivity would not be able to perform the kind of nonlinear mappings that the task requires (since there is less interaction between neurons in the reservoir than in networks which are more densely connected). The results show that this is not the case.
The Mackey–Glass prediction task requires shorter time constants and less memory than the other two tasks. In this case, the IPLAP networks perform best, slightly ahead of the RND condition. The PMT networks have the same spectral radius as the ones in RND; however, all eigenvalues in PMT have the same (large) magnitude. Therefore, the network is missing elements implementing shorter time constants, which would let it react to fast changes in the input. The best results we received on this task were actually achieved using ESN with fermi neurons and IP learning with an exponential output distribution (results not shown). In this case, the results were significantly better than the ones in RND [cf. also Steil (2007)].
A closer investigation of the almost identical performance of both IP methods revealed that IPLAP also generated normally distributed output, very similar to IPGAUSS. To better understand the effect of the different IP rules, we used IP to approximate the Laplace, the Gaussian (both with a tanh activation function), and the exponential distribution (fermi activation function), respectively, with a single feedforward unit and uniformly distributed input on the interval [−1;1]. As expected, the IP learning rule can successfully generate exponentially distributed output values [Fig. [Fig.5a].5a]. IP fails, however, to generate output distributions that resemble the Gaussian or the Laplace [Figs. [Figs.5b,5b, ,5c].5c]. This seems surprising, in particular, for the Gaussian, as IP has successfully been used to shape the output distribution of a reservoir (Schrauwen et al., 2008). A possible explanation for this phenomenon is discussed in the next section.
Recurrent neural network models are attractive because they hold the promise of explaining information processing in biological systems, such as the human neocortex, as well as being able to serve as tools for engineering applications. Moreover, they are computationally more powerful than simple feedforward architectures, since they can exploit long-term dependencies in the input when calculating the output. Their theoretical analysis is complex, yet they are increasingly popular—mainly due to new efficient architectures and training methods published in recent years [see Hammer et al. (2009) for an excellent recent overview]. These include, for instance, self-organizing maps for time-series (Hammer et al., 2005), long short-term memory (LSTM) networks (Hochreiter and Schmidhuber, 1997b), Evolino (Schmidhuber et al., 2007, 2005), and also the reservoir computing approach as mentioned earlier in the article.
One of the obstacles to the initial applicability of recurrent neural networks was the problem of slow convergence and instabilities of training algorithms such as backpropagation-through-time (BPTT) (Werbos, 1990) or real-time recurrent learning (RTRL) (Williams and Zipser, 1989). One of the important insights of the reservoir computing approach to overcome these limitations was that it often suffices to choose a fixed recurrent layer connectivity at random and only train the output connections. Compared to BPTT or RTRL, this simplifies the training process to a linear regression problem, resulting in significant speed improvements. It also enabled the use of much larger networks than possible with BPTT or RTRL. Results on standard benchmarks using reservoir methods were better than any previous of these methods and practical applications were also shown (Jaeger, 2003).
Compared to “traditional” recurrent neural network learning methods, such as BPTT or RTRL, the reservoir computing paradigm represents a significant step forward for recurrent neural network technologies. On the other hand, it is clear that the approach gives up many of the degrees of freedom the networks would normally have by fixing the recurrent layer connectivity. Advanced methods such as LSTM networks (Hochreiter and Schmidhuber, 1997b) share the advantage of fast learning with ESN, but without restricting the networks to a fixed connectivity, using a more involved architecture and training method. For ESN, it has been shown that fixing the connectivity has the effect that different random initializations of a reservoir will lead to rather large variations in performance if all other parameters of the network setup remain the same (Ozturk et al., 2007). There have been proposals on how to manually design ESN in order to give performance that will consistently be better than random initializations (Ozturk et al., 2007), but there are no universally accepted standard training algorithms to adapt the connectivity in a problem-specific and automatic way, before the output connections are trained.
This article makes two contributions with regard to this problem. As a first contribution, we investigate permutation matrices for the reservoir connectivity. Their benefit lies in the simple and inexpensive way in which they can be constructed, and we show that they implement a very effective connectivity for problems involving a long input history, as well as nonlinear mappings. For problems requiring fast responses of the network (or a mixture of slow and fast responses), their usefulness is limited. Furthermore, the method is general and not problem-specific. The IP-based approaches we present and reference throughout the article represent a problem-specific training method. They make use of the input signal in order to shape the output of the reservoir neurons according to a desired probability distribution. The second contribution in the article is a variation in the IP-based approach by deriving a new learning rule to shape the reservoir node outputs according to a Laplace distribution, and to point out limitations of this method when standard sigmoidal neuron types are used in the network.
Concerning this last point, the illustration in Fig. Fig.44 sheds light on the reason why an approximation of some distributions with IP is more difficult than others: given a uniform input distribution and a sigmoid transfer function, IP learning selects a slice from an output distribution that peaks toward either end of the input range, but never in the center. The output of an IP trained self-recurrent unit gives insight into why it is possible to achieve a Gaussian output distribution in a reservoir [Fig. [Fig.5d].5d]. From the central limit theorem it follows that the sum of several independent and identically distributed (i.i.d.) random variables approximates a Gaussian. Even though in case of a recurrent reservoir not all inputs to a unit will be i.i.d., IP has to make input distributions only similar to each other to approximate a normal distribution in the output. This also explains why the output of an IP Laplace trained reservoir is normally distributed (several inputs with equal or at least very similar distributions are summed up). For uniform input and a single unit without recurrence, the best IP can do is to choose the linear part of the activation function, so that the output is also uniformly distributed (a slice more in the middle of Fig. Fig.4).4). With self-recurrent connections, this leads to initially uniform distributions, which sum up. The resulting output and eventually the whole reservoir output distribution become more and more Gaussian. A consequence of this effect is that IP with sigmoid transfer functions cannot be generalized to arbitrary distributions.
Our results confirm two important points that have been suggested for versatile networks, i.e., networks, which should perform well even when faced with several different input signals or which might be used for tasks with different requirements. Ozturk et al. (2007) proposed that the eigenvalue distribution of reservoir matrices should be as uniform as possible, and that it would be needed to scale the effective spectral radius of the network up or down. For this scaling, they suggested an adaptable bias to the inputs of each reservoir node. With regard to this proposed requirement, we observed another limitation of purely IP-based reservoir pretraining: in our experiments [also reported in Steil (2007)], the IP learning rule always increased the spectral radius of the reservoir matrix (dependent on the setting of the IP parameters, cf. Schrauwen et al. (2008), and never decreased it (this is only true for reservoirs which are initialized with spectral radius <1). This leads to longer memory, making it harder for the network to react quickly to new input and causing interference of slowly fading older inputs with more recent ones. To alleviate this problem, a combination of intrinsic and synaptic plasticity, as studied by Lazar et al. (2007), could enable scaling of the reservoir matrix depending on the current requirements of the task. The efficiency of synergy effects that result from a combination of intrinsic and synaptic plasticity was also shown by Triesch (2007), who additionally suggested that this combination might be useful in driving networks to the region of best computational performance [edge of chaos, cf. Bertschinger and Natschläger (2004)] while keeping a fixed energy expenditure for the neurons. In Lazar et al. (2007), this hypothesis was confirmed to the extent that networks with a combination of intrinsic and extrinsic plasticity were generally closer to this region than networks trained with only one form of plasticity.
Learning algorithms for recurrent neural networks have been widely studied (e.g., see Doya (1995) for an overview). They have, however, not been widely used (Jaeger, 2003) because these learning algorithms, when applied to arbitrary network architectures, suffer from problems of slow convergence. ESN provide a specific architecture and a training procedure that aims to solve the problem of slow convergence (Jaeger, 2001a; Jaeger and Haas, 2004). ESN are normally used with a discrete-time model, i.e., the network dynamics are defined for discrete time-steps t, and they consist of inputs, a recurrently connected hidden layer (also called reservoir) and an output layer (see Fig. Fig.11).
We denote the activations of units in the individual layers at time t by ut, xt, and ot for the inputs, the hidden layer, and the output layer, respectively. We use win, W, and wout as matrices of the respective synaptic connection weights. Using f(x)=tanh x as output nonlinearity for all hidden layer units, the network dynamics is defined as
The main differences of ESN to traditional recurrent network approaches are the setup of the connection weights and the training procedure. To construct an ESN, units in the input layer and the hidden layer are connected randomly. Connections between the hidden layer and the output units are the only connections that are trained, usually with a supervised offline learning approach: Training data are used to drive the network, and at each time step t, activations of all hidden units x(t) are saved as a new column to a state matrix. At the same time, the desired activations of output units oteach(t) are collected in a second matrix. Training in this approach then means to determine the weight’s wout so that the error ϵtrain(t)=(oteach(t)−o(t))2 is minimized. This can be achieved using a simple linear regression [see Jaeger (2001a) for details on the learning procedure].
For the approach to work successfully, however, connections in the reservoir cannot be completely random; ESN reservoirs are typically designed to have the echo state property. The definition of the echo state property has been outlined in Jaeger (2001a). The following section describes this property in a slightly more compact form.
The echo state property: Consider a time-discrete recursive function xt+1=F(xt,ut) that is defined at least on a compact subarea of the vector-space xRn, with n being the number of internal units. The xt are to be interpreted as internal states and ut is some external input sequence, i.e., the stimulus.
The definition of the echo state property is as follows: Assume an infinite stimulus sequence: and two random initial internal states of the system x0 and y0. To both initial states x0 and y0 the sequences and can be assigned. The update equations for xt+1 and yt+1 are then
The system F() will have the echo states if it is independent from the set ut, and if for any (x0,y0) and all real values ϵ>0, there exists a δ(ϵ) for which
for all tδ(ϵ), where d is a square Euclidean metric.
For all of the experiments, we used ESNs with 1 input and 100 reservoir nodes. The number of output nodes was 1 for the NARMA and Mackey–Glass tasks, and 200 for the MC evaluation. In the latter, the 200 output nodes were trained on the input signal delayed by k steps (k=1…200). The input weights were always initialized with values from a uniform random distribution in the range [−0.1;0.1].
The output weights were computed offline using the pseudoinverse of a matrix X composed of the reservoir node activations over the last 1000 of a total of 2000 steps as columns, and the input signal. In the case of the MC task, the delayed input was used as follows: wout,k=(u1001−k…2000−kX†)T, with X† denoting the pseudoinverse and k=1…200.
For all three benchmark tasks, the parameter μ of the IP learning rule (both for IPGAUSS and IPLAP) was set to 0. The other IP related parameters were set according to Table Table3.3. For both IP methods the reservoir was pretrained for 100 000 steps in order to ensure convergence to the desired probability distribution, with a learning rate of 0.0005. In all conditions, the spectral radius of the reservoir connectivity matrix was scaled to 0.95 (prior to pretraining in case of IP).
Different input time-series were used for training the output weights and for testing in all cases. The input length for testing was always 2000 steps. The first 1000 steps of the reservoir node activations were discarded to get rid of transient states due to initialization with zeros before calculating the output weights and the test error.
To evaluate the short-term memory capacity of the different networks, we computed the k-delay memory capacity (MCk) defined in Jaeger (2001b) as
This is essentially a squared correlation coefficient between the desired signal delayed by k steps and the reconstruction by the kth output node of the network. The actual short-term memory capacity of the network is defined as , but since we can only use a finite number of output nodes, we limited their number to 200, which is sufficient to see a significant drop-off in performance for the networks in all of the tested conditions. The input for the MC task was random values sampled from a uniform random distribution in the range [−0.8;0.8].
The evaluation for the NARMA modeling and the Mackey–Glass prediction tasks was done using the normalized root mean squared error measure, defined as
where is the sampled output and y(t) is the desired output. For the NARMA task, the input time-series x(t) was sampled from a uniform random distribution between [0,0.5]. The desired output at time t+1 was calculated as
The Mackey–Glass time-series for the last experiment was computed by integrating the system
from time step t to t+1. The τ parameter was set to 17 in order to yield a mildly chaotic behavior.
Orthogonal networks (White et al., 2004) have an orthogonal reservoir matrix W (i.e., WWT=1) and linear activation functions. These networks are inspired by a distributed version of a delay line, where input values are embedded in distinct orthogonal directions, leading to high memory capacity (White et al., 2004). Permutation matrices, as used by Hajnal and Lörincz (2006), consist of randomly permuted diagonal matrices and are a special case of orthogonal networks. Here, and in Hajnal and Lörincz (2006), the hyperbolic tangent (tanh) activation function was used, in order to facilitate nonlinear tasks beyond memorization.
Using the initial form presented in the Results section, we get
where we have made use of the relation , where is the sampled distribution of the input. Writing this as and substituting it for in the first term of the above equation. In order to minimize the function DKL, we first derive it with respect to the bias parameter b,
The first term in the above equation is
so we have
The derivation with respect to the gain parameter a is analogous and yields
From these derivatives, we identify the following learning rules for stochastic gradient descent with learning rate η,
This work was partially supported by a JSPS Fellowship for Young Researchers, by the JST ERATO Synergistic Intelligence Project, and by Taiwanese National Science Council grant number 98-2218-E-194-003-MY2. We thank Richard C. Keely for proofreading the article, and the Reservoir Laboratory at University of Ghent, Belgium for providing the Reservoir Computing (RC) toolbox for MATLAB, as well as Herbert Jaeger for making source code available. Furthermore, we thank the anonymous reviewers for constructive comments on the initial drafts of the article.