|Home | About | Journals | Submit | Contact Us | Français|
When a neuronal spike train is observed, what can we say about the properties of the neuron that generated it? A natural way to answer this question is to make an assumption about the type of neuron, select an appropriate model for this type, and then to choose the model parameters as those that are most likely to generate the observed spike train. This is the maximum likelihood method. If the neuron obeys simple integrate and fire dynamics, Paninski, Pillow, and Simoncelli (2004) showed that its negative log-likelihood function is convex and that its unique global minimum can thus be found by gradient descent techniques. The global minimum property requires independence of spike time intervals. Lack of history dependence is, however, an important constraint that is not fulfilled in many biological neurons which are known to generate a rich repertoire of spiking behaviors that are incompatible with history independence. Therefore, we expanded the integrate and fire model by including one additional variable, a variable threshold (Mihalas & Niebur, 2009) allowing for history-dependent firing patterns. This neuronal model produces a large number of spiking behaviors while still being linear. Linearity is important as it maintains the distribution of the random variables and still allows for maximum likelihood methods to be used. In this study we show that, although convexity of the negative log-likelihood is not guaranteed for this model, the minimum of the negative log-likelihood function yields a good estimate for the model parameters, in particular if the noise level is treated as a free parameter. Furthermore, we show that a nonlinear function minimization method (r-algorithm with space dilation) frequently reaches the global minimum.
Integrate-and-fire (IF) model neurons ignore the rapid spike generation process, instead emphasizing sub-threshold integration dynamics (Burkitt, 2006). IF neurons are thus more realistic than the simplest neural models, like McCulloch-Pitts neurons or coincidence detectors (McCulloch & Pitts, 1943; Mikula & Niebur, 2003, 2004, 2008), which makes them a computationally efficient model that can predict spiking times recorded in some physiological experiments reasonably well (Kim, Sripati, Vogelstein, Armiger, Russell, & Bensmaia, 2009). Extensions of the classical integrate-and-fire model, introduced more than a century ago (Lapicque, 1907), expand the applicability of the model to more biophysically realistic firing patterns (Izhikevich, 2003; Brette & Gerstner, 2005; Tonnelier, Belmabrouk, & Martinez, 2007; Mihalas & Niebur, 2009).
In the traditional IF neuron, the membrane voltage is reset to a predetermined value after each spike, and each interspike interval is independent of the previous. This lack of history dependence limits the complexity of the spiking behavior the model can generate. For instance, adaptation and bursting behavior are found in many biological neurons but cannot be explained by the classical IF neuron. Mihalas and Niebur (2009) extended the IF neuron model by adding a variable threshold and an arbitrary number of spike-induced currents (Hille, 1992), all with linear dynamics. This generalized integrate-and-fire neuron is capable of producing spiking/bursting, tonic, phasic or adapting responses, depolarizing and/or hyperpolarizing afterpotentials etc.. The equations governing the time evolution of the membrane potential are solved analytically between spiking, allowing (but not requiring) event-based implementations of the model. Furthermore, the analytical solution is obtained very efficiently since the model dynamics can be written as a diagonalizable set of linear differential equations. In this study, we focus on this generalized leaky IF (LIF) model, because it is the only one of the cited extensions of the LIF models that maintains linearity of the state variable dynamics.
An important problem in computational neuroscience is the determination of the parameters of a neuronal model given the available experimental data (Prinz, 2007). We assume that only spike trains (i.e., the sequence of time points when a neuron spiked) are available but not voltage traces; this is the case in extracellular recordings. Parameter optimization is then accomplished by minimizing a suitably chosen cost function. Choosing a cost function is not easy; for instance, using the benchmark (the “Г− factor”) for spike-train prediction proposed by the INCF Quantitative Single-Neuron Modeling 2009 competition as the cost function (Jolivet, Kobayashi, Rauch, Naud, Shinomoto, & Gerstner, 2008) results in large numbers of local minima. Finding the global minimum then requires substantial computational resources (Rossant, Goodman, Platkiewicz, & Brette, 2010).
A natural cost function for parameter fitting is the maximum likelihood estimator of the observed sequence of spike times. Paninski et al. (2004) showed that the negative log-likelihood function of the stochastic LIF neuron model is convex and that its unique global minimum can thus be reached using gradient descent techniques.
As mentioned, the complexity of biological spike trains exceeds what can be generated by the stochastic LIF neuron but the generalized LIF model can generate a much richer set of behaviors (Mihalas & Niebur, 2009). Assuming that noise appears only in the threshold, we have shown that the likelihood function of the generalized LIF model can be computed by solving the one-dimensional Fokker-Planck equation (Russell et al, in press). In the present study, we first generalize that result by deriving the Fokker-Planck equation of the likelihood function with noise in both membrane voltage and threshold. For computational efficiency, we then derive a reduced, one-dimensional Fokker-Planck equation for the likelihood function, in this case assuming noise only in the membrane voltage. We then show that numerical accuracy is substantially improved by computing the likelihood function of the generalized LIF model by the Volterra integral equation method (Paninski, Haith, & Szirtes, 2008). Convexity of the negative log-likelihood function is not guaranteed after inclusion of the variable threshold but we show that in practice, common optimization methods converge for this generalized LIF neuron.
The membrane voltage V in the stochastic LIF model evolves according to the dynamics
where Isti(t) is the input current and Iind(t) is the sum of all spike-induced currents. The Gaussian white noise, ε(t), is scaled by the constant σ, and C, g, and Vleak are constants representing the capacitance, membrane conductance, and equilibrium membrane voltage, respectively. A spike is generated and the voltage is reset when V (t) exceeds the threshold Θ, a constant. Paninski et al. (2004) showed that the negative log-likelihood function of generating a given (observed) spike train by this model is convex, and that its global minimum can therefore be found using gradient descent techniques.
Mihalas and Niebur (2009) generalized the LIF model by introducing a variable threshold (Θ(t)) which depends on the membrane voltage,
where a, b are threshold adaptation and threshold rebound constants, respectively, and Θ∞ is the threshold at equilibrium. The dynamics of the threshold introduces history dependence, i.e. correlations between interspike intervals. The model allows for an arbitrary number N of spike induced currents, given by the following equations,
When V reaches threshold, V (t) = Θ(t), three types of events occur. First, a spike is generated. Second, voltage and threshold are set according to the following boundary conditions,
ensuring that after the update, the membrane potential V (t) is below the instantaneous threshold Θ(t). Third, the spike induced currents Ii(t) are updated by the following rules:
where Ai is the magnitude of the added current. This model is capable of generating a much larger variety of neuronal behaviors than the traditional LIF model (Mihalas & Niebur, 2009).
Assume that a neuron receives a known stimulus (input) current and in response generates a spike train. Assume further that we decide to model it as a generalized LIF neuron. How do we choose the parameters of the model? A natural estimator of the parameters is given by choosing them such that the observed spike train is the most likely for the given input current. We thus need to maximize the likelihood of observing the spike train or, equivalently (but technically more conveniently), minimize its negative log-likelihood. In this Section, we compute the likelihood for the whole spike train as a function of the individual inter-spike intervals (note that this computation would be trivial if the inter-spike intervals would be independent which is not the case in the general case we are concerned with here). In the following Section (2.3), we will compute the likelihood for individual inter-spike intervals.
The stimulus current starts at time t0 and continues throughout the trial. The spike train is composed of a total of n spike time intervals, with the ith spike time interval starting at time ti−1 and ending at time ti. Define the conditional probability that the neuron fires at time t, which is in the ith spike time interval, given the previous spike times intervals, as
Equivalently, the function fi is the probability density function of the first passage time in the ith spike time interval. We define Ei for i > 1 as the event that the neuron fires at times ti−1 and ti (and not in-between; also note that event E1 consists of the stimulus onset at time t0 and the first spike at t1). The likelihood of the observed spike train is that all events E1, . . . ,En happen, which is P(E1∩E2∩∩En). From the multiplication rule in probability theory we know that,
Here, Sk(t) := V (t) − Θ(t) is the voltage above threshold and the variable t in this equation is understood to be in the kth spike time interval, t (tk−1, tk). The likelihood that the neuron fires at time ti given its history is obtained by evaluating the conditional probability in eq. 6 at time ti, yielding
The likelihood of the train of n spikes is thus,
Having obtained the likelihood for the spike train as a function of the likelihoods of the individual inter-spike intervals, we now have to compute the latter, i.e. the functions fi(ti|tk, k = 1, . . . , i − 1). This can be achieved by two methods1 proposed by Paninski et al. (2004, 2008). The first, discussed in Section 2.3.1, is by solving the Fokker-Planck equation for the probability density function of the state variables (voltage and threshold; the time dependence of the threshold requires a slight generalization of Paninski et al. (2004)’s approach). The second method, discussed in Section 2.3.3, is by solving a Volterra integral equation directly for the likelihood of the inter-spike interval. We also introduce, in Section 2.3.2, a one-dimensional approximation for the Fokker-Planck equation which is more efficient to compute than the two-dimensional version.
For simplicity, we introduce the change of variables V (t) → CV (t), Θ(t) → CΘ(t) and g → g/C. Defining as before S = V − Θ (voltage above threshold) and
Now define P(S,Θ, t) as the probability density function for the variables S and Θ at time t. The evolution of P(S,Θ, t) is governed by the two-dimensional Fokker-Planck equation,
There are two boundary conditions. One is absorbing,
since the membrane voltage can cross the threshold only once. The other is the resetting condition at spike time ti. We show in Appendix A that this condition is
where δ() is a delta function, is the time immediately before the reset happens, and R is a linear rectifier
Finally, Z is defined in Appendix A, it ensures that the probability is normalized to unity. Given P(S,Θ, t) for the ith spike time interval, the likelihood is the computed from P as
In practice, the 2-D Fokker-Planck equation method works well for calculating the likelihood function but it is inefficient and therefore of limited use. In the following section we show how the dimension in which the Fokker-Planck equation needs to be solved is reduced which results in a more efficient solution; it is the latter which will be used in Section 3.2.
The numerical computation of the partial differential equation eq. 12 is time consuming since it requires calculation of the probability density in the two-dimensional (S, Θ) space. One way to reduce the complexity of the problem is to use a deterministic threshold Θd(t) which is computed from the deterministic version of Equation 11,
The voltage above threshold variable S, however, remains a stochastic process, governed by,
where σ1 is the noise level for the 1-dimensional problem. Note that the deterministic threshold Θd(t) obtained in eq. 15 is used in eq. 16. The probability density then depends only on one variable, S, and the Fokker-Planck equation is one-dimensional,
There is only one boundary condition, and it is again absorbing,
As shown in Appendix B, the stochastic term of S(t) in the 2-D case is,
and in the 1-D case it is,
Note that the difference between the 1-D and 2-D cases for the stochastic process S(t) is only the noise model. Both noise terms are Gaussian processes but the structure is more complex in the 2-D case because of the additional exponential function. We show again in Appendix B that when t is long enough, its variance saturates at
Finally, we show in Appendix B that when t is long enough, the variance of the 1-d noise term saturates at
and that the noise levels in the 1-D and 2-D conditions become comparable if the following condition is met:
Given P(S, t) for the ith spike time interval, the likelihood can be evaluated as
We show in the Results section that for the optimization of deterministic neuronal models, details of the noise model do not matter. Noise is treated as an auxiliary parameter in the optimization by starting with a large value and then decreasing it. The optimized parameters obtained by the 1-D Fokker-Planck equation method and the 2-D Fokker-Planck equation method converge in the case of infinitesimal small noise. For reasons of efficiency, only the 1-D Fokker-Planck equation is solved in Section 3.2.
Since the time derivatives of the state variables S,Θ depend linearly on the variables in eqs. 11, the Mihalas and Niebur (2009) model is an Ornstein-Uhlenbeck process. Schrödinger (1915) showed that the likelihood for the first-passage time can be obtained by solving a Volterra integral equation of the second kind.
Using the same notation as in Section 2.2, the likelihood for the ith spike time interval is the first-passage-time probability density (FPTPD) function evaluated at the neuron spike time ti (Buonocore, Nobile, & Ricciardi, 1987; Paninski et al., 2008). To improve computational efficiency (see Section 2.3.2), we assume that noise is injected only in the voltage, not the threshold (this approach is complementary to the one we took in Russell et al, in press), i.e., in the following we replace the stochastic threshold by the deterministic threshold Θd(t).
The equation governing S(t) is the same as Equation 16,
with the definition . Taking the expectation value of the square of formula 20, we obtain the variance of the S(t|x, ti−1) process as3
Since S(t) is an Ornstein-Uhlenbeck process, it is Gaussian. It’s transient probability G(y, t | x, ti−1) is the probability that S(t) = y given S(ti−1) = x:
Plesser and Tanaka (1997) showed that the FPTPD function p(t) for the stochastic process S(t) can be computed by a Volterra integral equation of the first kind, and Buonocore et al. (1987) transformed this into a Volterra integral equation of the second kind (an equivalent equation for constant input was derived and solved by Schrödinger (1915)):
In our previous study (Dong et al, in press), we showed that calculating the FPTPD function by discretizing the probability current can result in large numerical errors if the noise is small. The problem is solved by analytically calculating the mean probability current over the bin size, i.e. the time interval Δt, as follows. We assume that only G(0, t|x, ti−1) varies within the time bin [t′, t′ + Δt], thus
is a constant. In the function G(0, t|x, ti−1), we introduce the notational shortcuts μ0 := μ(t′ | x, ti−1) and μ1 := μ(t′ + Δt | x, ti−1) and consider μ(t | x, ti−1) to vary linearly within the interval [μ0, μ1] and. These approximations allow us to integrate the probability current through the interval [t′, t′ + Δt] analytically, making use of its Gaussian shape. Defining
the mean probability current
can be simplified as
where erf() is the error function.
Using instead of in Equation 28, we can compute the FPTPD accurately for a wide range of noise levels (see Dong et al, in press, for details). Given the FPTPD function p(t) for the ith spike time interval, the likelihood can then be evaluated as fi(ti|tk, k = 1, . . . , i − 1) = p(ti).
The Fokker-Planck equation is a partial differential equation (PDE) which can be solved by finite volume methods. Briefly, state space is divided into discrete cells. Since total probability is conserved, the integrated probability flux through a cell’s surfaces is equal to the probability change inside the cell. The probability density function within each cell as well as its flux through the surfaces between cells are computed. An exponential integration scheme (derived from the exact solution, by exponentially interpolating each sub-region of the grid space) is used for the finite volume method to avoid negative probability densities (Figueiredo, 1997).
Paninski et al. (2008) described a numerical solution of the Volterra equation which yields the gradient of the likelihood function. It requires memory of size L2, where L is the number of time bins. Since we have no need for the gradient, we instead use an algorithm that only requires memory of order L (to store the history of μ(t) and Σ(t)). We also compute μ (Equation 25) and Σ (Equation 26) by solving differential equations, rather than explicitly evaluating exponential functions, see footnote 3.
To increase computation speed, we use an optimization scheme developped in our earlier study (Dong et al, in press). Briefly, time bins with negligible likelihoods can be easily identified if the membrane voltage is either far above or far below the threshold both at the beginning and at the end of the bin; formally, if
where ξ and E were defined in eqs 31, 32 and M is the number of standard deviations beyond which the error function becomes indistinguishable from (positive or negative) unity in the floating number representation chosen. In our case (double precision accuracy), M = 5.9. No computation is required in any time bin satisfying the criteria of equations 35 or 36 since within machine precision this bin does not contribute to the likelihood. For low noise, this drastically decreases the effective number of time bins where the function is evaluated, which considerably decreases memory usage and computation time (see Dong et al, in press, for details).
To determine the local minimum in the case of synthetic spike trains (Section 3.2), we use the r-algorithm with adaptive space dilation (Shor, Kiwiel, & Ruszcayski, 1985), implemented as the ralg algorithm at http://openopt.org. This heuristic algorithm is linearly convergent for dimension K = 2 (this is the number of arguments of the cost function, i.e., the number of model parameters subject to optimization) and it has been conjectured that this is the case for general K (Burke, Lewis, & Overton, 2008). To avoid runaway simulations, we limit the maximal numbers of function evaluation and iterations. In our simulations, the algorithm converged in each of over 1000 runs, i.e.,, the limits were exceeded in none of them (results shown in Fig. 3).
For the model optimization in Section 3.2.2 and for fitting the experimental data in Section 3.3, we use an evolutionary annealing-simplex algorithm, which combines the Nelder-Mead simplex algorithm with simulated annealing and evolutionary techniques (Efstratiadis & Koutsoyiannis, 2002). It starts with a large population of vertices out of which a simplex is formed by random selection. The Nelder-Mead algorithm is then run on this simplex.
We optimized the neuronal model with 11 parameters including some nonlinear terms like adaptation and absolute refractory period. Parameters are listed in Table 2.
In this Section, we compare the performance of the Fokker-Planck equation method with that of the Volterra integral equation method.
The first passage time probability density function of the model neuron is calculated by the Fokker-Planck method (“FP” in the legend of Fig. 1C) and by the Volterra integral method (“Volterra”), both are described in Section 2.4. The “ground truth” is obtained from a Monte Carlo scheme (“MC”) by simulating 1 million instances of the stochastic process S(t), all starting with the same parameters but different random number seeds and computed from eq. 24 with σ = 10−11.
Two types of currents were provided to the model neuron. One is constant (discussed in the next paragraph), the other is time-varying. The latter, shown in 1A, is the initial part of the mixed excitatory and inhibitory current defined in Section 3.2.1 and shown in Fig. 2. For this input current, which vanishes at t = 0, the FPTPDs calculated by all three methods agree (Fig. 1C). It is known that a large ratio of drift vs. diffusion (“Péclet number”) results in large numerical errors (Odman, 1997). Although diffusion is small at t ≈ 0 (see Fig. 1G), the drift is too, as is the Péclet number, and numerical errors are small. The variances of S(t) calculated from Fig. 1E,F agree with the result from Equation 26.
If, on the other hand, a fairly strong constant current is injected from the beginning (Fig. 1B), the Fokker-Planck method is less accurate than the Volterra integral method (Fig. 1D). Figure 1E, F shows the evolution of the probability density function evaluated at the grid points (S, t). As shown in Fig. 1H, the variance of S(t) calculated from the probability density function (Fig. 1F) is larger than the theoretical value (obtained from eq. 26).
The Volterra integral method is not subject to this limitation since it does not require sampling of S. However, it is slower to calculate since its complexity increases quadratically in time4 while the 1-D Fokker-Planck method increases linearly. Because of its superior accuracy, the Volterra integral method will be used in assessing the performance of the maximum likelihood model for the generalized LIF model in sections 3.2 and 3.3.
To validate the method, in Section 3.2.1 we will generate spike trains from the generalized integrate-and-fire neuron model (Mihalas & Niebur, 2009) with known parameters. In Section 3.2.2, we then select random parameter values and attempt to find the correct ones, using the negative log-likelihood for generating the original spike trains as a cost function. We will use both a local and a global optimization algorithm.
The local optimization serves to find, for each initial condition (randomly selected parameters), the local minimum closest to the respective initial values. This is a (somewhat rough) indicator of the usability of the cost function: in general, it is desirable that more rather than less initial conditions are in the basis of attraction of the global minimum, rather than a local, non-global one. Indeed, we will find that the two variations of the method (which will be introduced in Section 3.2.2 and differ in the treatment of noise) that we analyze differ in this respect, with a substantially larger fraction of initial conditions ending up in the global minimum for one vs. the other (about 3/4 vs. 1/2).
In the second set of numerical experiments we will use a global minimization algorithm. We show that it does, indeed, find the exact parameter values in all cases. This validates that the maximum likelihood method is suitable for parameter estimation for the class of problems studied here.
To generate the input current for the numerical experiment, two spike trains 10s length with homogeneous Poisson statistics were generated: one excitatory with mean firing rate 150Hz and one inhibitory with rate 300Hz. The input current to the neuron is then calculated by convolving the Poisson trains (each a sum of delta functions) with an exponentially decaying function of the form
where A is the amplitude of the currents, τ is the decay time constant, and H(t) is the Heaviside step function. For the excitatory input, we choose A = 0.2nA and τ = 5ms, for the inhibitory input, A = −0.04nA and τ = 25ms. The injected input current is the sum of excitatory and inhibitory current inputs. The resulting current (shown in Fig. 2) is injected in the generalized integrate-and-fire neuron (term Isti on the right-hand-side of eq. 1) and the model generates the spike train shown in the Figure. Note that the neuron model used here is deterministic, without any injected noise (σ = 0).
The parameters of the IF model (row ABIF in Table 1) were chosen to make it both adapting and bursting (thus “ABIF”). The membrane resting potential and membrane resetting potential were fixed at −70mV and the equilibrium threshold at −50mV. The threshold rebound time constant were set to 100ms and the absolute refractory period of the neuron to 2ms. Parameters to be determined by the optimization process are capacitance C, membrane conductance g, and adaptation constant a.
Furthermore, two spike induced currents are included in the optimization: E is the amplitude of an excitatory spike-induced current, with τ1 = 5ms, and I that of an inhibitory spike induced current, with τ2 = 25ms. All parameters subject to optimization are shown in Table 1, row ABIF.
An important insight of Paninski (2004) was that the negative log likelihood function of the stochastic LIF neuron is convex, i.e., it only has one (global) minimum. This is not necessarily the case for more realistic models, like that described by eq 2. It is thus important to analyze the cost function (the negative log likelihood) and we find in the remainder of this Section that in general this model has, indeed multiple local minima. We also study two different cost functions, which differ in how stochasticity is treated, and we compare an indicator of the number of local minima in them.
The negative log-likelihood estimator method requires stochasticity of the neuron model, that is, noise needs to be added to the deterministic model. We consider two ways to include noise which give rise to two different likelihood functions (and thus cost functions). The first is to fix the noise amplitude σ (at a low level) and to optimize the model parameters (“fixed noise case”). The second is to formally treat the noise level as a parameter, and to optimize the solution for all parameters, including the noise level (“variable noise case”). Obviously, the true noise level is zero and solutions which yield nonzero results are incorrect (not at the global minimum).
We used the r-algorithm with space dilation5 on 300 randomly selected initial location in the parameter space defined below. For each of these 300 points, we find the closest local minimum of the negative log likelihood as described in the Methods section (2.5). In each dimension, upper and lower boundaries of the search space are defined as ±4 times of the absolute value of the true parameter except (in the variable noise case) the noise level whose true value is zero (see rows 2,3 of Table 1). To assure comparable scaling for the different parameters, we map the parameter space linearly into a multidimensional rectangle with edges [−1, 1] for parameters with both positive and negative values (E, I), and with edges [0, 1] for positive parameters (C, g, a, σ). The initial parameters for the optimization are drawn from a uniform distribution in the multidimensional rectangle space.
As was mentioned above, in the fixed noise case we set the noise level to a small number6 σ = 5.5e−12. Because of the injected white noise, the parameters corresponding to the global minimum differ from the correct values, those used to generate the original spike train (Table 1 row Fixed noise params). Since the noise is optimized in the variable noise case, we can use a larger initial noise level, by choosing it (uniformly) in the range σ [9e − 12, 1.2e − 11] which is mapped in the transformed space to the interval [0.75, 1].
We ran 302 independent optimizations for the fixed noise case and 303 independent optimizations for the variable noise case with the local optimization method. We also ran 434 independent optimizations for the variable noise case with the global optimization method7. As shown in Figure 3A, 49% of the optimizations successfully converged to the global minimum in the fixed noise case. This number was substantially higher, 73.9%, in the variable noise case (Figure 3B). Furthermore, the computation time of the variable noise method was lower than for the fixed noise method. In the fixed noise case, the noise level is low but not negligible (σ = 5.5e−12), and the number L of time bins for the function evaluation is essentially constant. In the variable noise case, the optimization starts with a large noise level and consequently runs slowly in the beginning. Over time, the true noise level (zero) is approached and the contribution to the total likelihood of more and more bins becomes negligible. These bins are then eliminated from the evaluation, as discussed in Section 2.4, which increases the computation speed dramatically since the complexity of the algorithm is quadratic in the number of bins, see Footnote 4. As a consequence, we find that, in practice, variable noise optimization is an order of magnitude faster than fixed noise optimization.
We also find that fixing the noise level leads to systematic errors in the parameter estimation, as shown in Table 1, row Fixed Noise params. Furthermore, we find that in the example chosen, more points are trapped in a prominent local minimum in the fixed noise case than in the variable noise case, presumably due to the fact that the initial values of the likelihood for some of the parameters is too small to allow computation of their correct values (see footnote 6). This problem is alleviated in the variable noise case by starting with a larger noise level. The initial likelihood values are therefore larger and the finite gradient of the likelihood function then allows accurate computation of all parameters.
We note in Figure 3 that in rare cases (five for fixed noise, one for variable noise) the negative likelihood actuallynnn increases during the optimization. These failures of the r-algorithm are due to the fact that it is a sub-gradient optimization method (Shor et al., 1985) which allows minimization of non-differential functions but, given that the gradient of such a function not always exists, does not guarantee that the solution always descends.
Having validated the model with synthetic test data generated from a model where we know the correct parameter values, we now apply the method to the estimation of model parameters for fitting experimental (biological) data.
The experimental data were taken from the Spike Prediction Competition 2009 Challenge B (Gerstner & Naud, 2009). An L5 fast-spiking cell was recorded with in-vivo-like current injection. We used data from eight trials of 14s length each for the training of the model and another 8s for the prediction. We used the same generalized integrate-and-fire model as before and optimized 11 parameters, including the amplitudes of four spike-induced currents amplitude (their decay constants are kept fixed at 5ms,10ms,20ms,50ms), absolute refractory period, noise level, threshold adaption, threshold rebound, membrane time constant, voltage resetting and equilibrium membrane voltage. The boundaries for these 11 parameters are shown in Table 2 row Lower and Upper. The final optimized parameters are shown in row Optim.
Training and prediction results are shown in Figure 4 A,B. The bench-mark proposed for the quality of the training and prediction by Jolivet et al. (2008) is the average spike-time coincidence (Г− factor) between predicted spikes and measured spikes across 8 repetitions, scaled by the intrinsic reliability of spiking across several trials. On the training data, the model explained 68.4% of the spikes, corresponding to a normalized performance by the Jolivet et al. (2008) measure of 98.3%, with a confidence level (computed by bootstrap) of ±4.7%. For prediction, our model explained 66.3% of the spikes, corresponding to a performance of 93.6%, confidence level ±1.59%.
Note that we used the data from the spike prediction competition only as a conveniently available spiketrain data set. A direct comparison of the performance of our method with others used in the competition is not possible since we only use a small part of the training data, and we use it for both training and prediction (since the “prediction data” has been published, a genuine prediction is no longer possible). Nevertheless, we note that although our prediction results are lower than the winners of the competition, they are better than the majority of submitted results (7/11). Importantly, we optimize the neuronal model based on the likelihood function for the spike trains, without making any use of the definition of the Г 2212 factor while the entries to the competition were tailored to achieve maximum performance under this measure.
For neuronal model optimization, the key question is the choice of the cost function. One possibility is to use the model for generating spike trains (which, of course, depend on the model parameters) and then compute the “distance” between these and the observed spike using one of a variety of proposed a spike train distance measures as cost function. However, there are two drawbacks of using spike train distance measures. The first is that their cost function in general has a lot of local minima and therefore a stochastic global optimization method has to be used, which is time consuming. The second is that spike train distance measures are not parameter free. The popular spike distance measures all have at least one freely chosen parameter (Victor & Purpura, 1996; van Rossum, 2001; Jolivet et al., 2008). The choice of the free parameter could be made part of the optimization problem but this increases the dimension of the problem even further.
Maximum likelihood estimation solves both of these problems. The likelihood function is a natural measure of distance which is also parameter free. Furthermore, Paninski (2004) showed that the negative log likelihood function of the leaky integrate-and-fire neuron is convex. This guarantees the existence of a unique global minimum and also allows to find it using simple gradient-descent methods. Unfortunately, the dynamics of the straightforward LIF model are not rich enough to explain the behavior of many biological neurons. The generalized LIF model (Mihalas & Niebur, 2009) generates a much richer dynamics but the proof of convexity is not applicable to it. Indeed, our numerical experiments show that the likelihood function is encumbered with local minima; in our somewhat crude estimation, about 1/4 to 1/2 (depending on how stochasticity was treated, see next Section) of initial conditions were closer to local minima rather than the global minimum. However, the global minimum is easily attained with a global optimization algorithm (Fig. 3C,F).
Finding the minimum of the negative log likelihood function requires that the neuron model is stochastic, i.e., the presence of non-zero noise. It is a matter of practicality how this noise is treated. One possibility is to fix the noise level to a small finite number and find the minimum negative log-likelihood for the model parameters (in our case, those in row ABIF in Table 1). Alternatively, the noise can be introduced as an additional parameter that is itself subject to optimization. Although one might expect that increasing the dimensionality of the optimization problem (in our case, from five to six) would substantially decrease numerical efficiency, we found the opposite: optimizing the noise level simultaneously with the model parameters not only eliminates the need for choosing a somewhat arbitrary parameter but it actually increases performance.
The reason for this seemingly counter-intuitive result is that choice of the noise level is non-trivial (indeed, the optimal level may vary during the course of the optimization process). If the noise level is chosen too small, the optimization is likely to stay in local minima. If it is set too large, estimated parameters will have a bias (see Table 1). The better solution is thus to start with a large noise level and let the optimization process find adaptively the optimal level. Our results show that this procedure finds the global minimum substantially more frequently than if noise is chosen fixed. Furthermore, due to the speed optimization introduced earlier (Dong et al, in press), we find that the adaptive noise level algorithm also runs considerably faster than if a fixed level of noise is chosen.
Two evaluation methods for the likelihood function are compared in this study: Fokker-Planck equations and Volterra integral equations. The Fokker-Planck method uses a PDE solver to numerically solve the Fokker-Planck equation for the probability density of the time-dependent state variables (voltage and threshold). It’s one-dimensional version (where the threshold is considered a deterministic, non-stochastic variable) has the advantage of fast calculation and the complexity is O(n). A weakness is that it suffers from numerical diffusion problems and, as we show, it is less accurate than the solution of the Volterra integral equation. These problems are in our experience alleviated if interspike-intervals are long compared to the membrane time constant in which case the numerical diffusion is frequently small compared to actual diffusion caused by noisy current input.
The Volterra integral method consists of computing numerically the integral in a Volterra integral equation (of the second kind). It only requires discretization of time variable. We find that this method is more accurate in the calculation of the likelihood function than the solutions of the Fokker-Planck equation, however its complexity grows quadratically with the number of time bins, i.e., the length of the interspike intervals.
These complementary properties of the two methods point to an obvious solution, by combining them: For long interspike intervals, the Fokker-Planck method is fast and its weakness (numerical diffusion) is masked by actual diffusion (at least for sufficiently complex input, which is usually found in biologically relevant circumstances). For short interspike intervals, the Volterra integral method is accurate and its weakness (quadratic complexity in number of time bins) is of little importance. Thus, by using the Fokker-Planck equation method for long inter-spike intervals and the Volterra Integral equation method for short intervals, a fast and accurate solution is obtained.
Quantitative studies of biological nervous systems characterize their behavior in terms of analytical models whose parameters need to be fit to the available data. This requires to choose a model that is, on the one hand, complex enough to capture the essential properties of the biological neurons, and on the other hand accessible for efficient and accurate parameter fitting. One example of a successful compromise is the leaky integrate-and-fire model, in particular since Paninski et al. (2008) showed that its maximum likelihood for generating a specific spike train can be found by gradient ascent. The dynamics of this model are, however, too simple for many biological neurons. A generalized leaky integrate-and-fire model (Mihalas & Niebur, 2009) provides much richer dynamics and we show in this contribution that, although simple gradient ascent techniques will not work, the maximum likelihood function of this model can be found readily and efficiently. We expect that this will allow the wide-spread use of the powerful maximum-likelihood methods to this important interface of experimental and theoretical neuroscience.
Supported by NIH grants NEI R01EY016281 and NINDS 5R01NS40596 and the Office of Naval Research grant N000141010278.
The Fokker-Planck Equation 12 can be understood as a continuity equation
for the two-dimensional probability current J. It has two components (JS, JΘ),
Before calculating the boundary condition, we first try to find the conditional probability distribution of Θ, conditioned on S(t) < 0, t < ti and S(ti) = 0. Intuitively, one might think it is JS(S = 0,Θ, ti), which is the probability of P going beyond S = 0 per unit time and per unit Θ at time ti and threshold Θ. However, due to the absorbing boundary condition, the probability density vanishes at the threshold, P(0,Θ, t) = 0. From Eq. 40, we find JS(S = 0,Θ, ti) = 0. Next, we consider the probability current at S = −ΔS, which is the probability of P going beyond S = −ΔS per unit time and per unit Θ at time ti. This current will push up the probability density P in the range S = [−ΔS, 0]. Due to the absorbing boundary condition, we only consider the absorbed P as “firing” probability. Absorbing happens at the point S = 0 (not in an interval). To calculate the absorbed part of the probability density P at the single point S = 0, the absorbed probability can be approximated by
if ΔS is very small. Then Eq. 41 is the conditional probability of Θ.
Considering JS(S = 0,Θ, ti) = 0, Eq. 41 can be expressed in another form:
In the limit ΔS → 0, the difference in Eq. 42 becomes the derivative,
Equation 43 is not really a density distribution function, it may contain negative terms and it is not normalized. Defining R as the usual linear rectifier:
we obtain the conditional density distribution function for Θ:
where Z is the normalization factor
It is easy to show that the probability S(t) < 0,t < ti and S(ti) = 0 (which is the conditioned likelihood as discussed below) is this normalization factor Z.
Having obtained the conditional probability distribution function for Θ, computing the resetting boundary condition now simply involves mapping the density function onto the line S = Vreset − Θ in 2-D space (S, Θ)
where δ() is the Dirac Delta function and , approach the spike times ti from above and below, respectively.
We find from Equation 1 that the noise of the process V (t) is
From Eq.49, we see that the noise of S(t) is
Changing the order of integrations in the first term in the Eq. 50,
Equation 50 can be simplified as :
For noise only appearing in the V (t) term, we find from Equation 24 that the noise of the process S(t) is,
Defining and assuming τ1 < τ2, we can calculate the covariance of Cov (NS(τ1),NS(τ2)):
Then the variance of NS(t) is computed as:
If t is large compared to 1/g, 1/(b + g), 1/b, the variance of NS(t) saturates at
Taylor expansion of Eq. 25 around ti−1 up to 2ed order yields,
where Δt = t − ti−1 and, from the Taylor expansion of Eq. 26 up to 2ed order:
The transient probability G in Eq. 27 has a singularity in the limit t → ti−1; it is 0 if y ≠ x and ∞ if y = x, which is a Dirac function:
The FPTPD p(t) can be computed from a Volterra integral equation of the second kind:
where , is the time derivative of the cumulative distribution of the random variable S(t) and S0 = S(0) is the initial condition of S(t). The time derivative of μ(t | x, ti−1) is
and the time derivative of Σ2(t|ti-1) is
Defining , we then obtain ψ as
Since G(0, t | x, ti−1) = δ(x − 0) when t → ti−1, G diverges at x = 0. To remove the singularity, we use the following transformation, as proposed by Buonocore et al. (1987):
When x = 0 and t → ti−1, (0, t | x, ti−1) is set to be 0. λ(t) is obtained as
We thus find the probability current with singularity removed as
1Paninski et al. (2004) proposed a third method, directly evaluating the Gaussian integrals occurring in the likelihood functions. This method, however, only provides a crude approximation to the likelihood and furthermore it is not efficient to calculate. We will not discuss it any further in this report.
3The numerically costly evaluation of the exponential function in equation 26 can be avoided by transforming it into a differential equation,
4The integration in eq. 28 takes i steps in the i-th iteration, therefore computation to a time of order n takes ≈ (1 + n)n/2 steps, which is O(n2).
5A simple gradient descent method will run into problems because the likelihood function is not everywhere differentiable, e.g. when the threshold is crossed.
6We cannot choose this too small, otherwise most of the small likelihood range for random initial parameters is beyond the range of double precision calculations. During the numerical computation, very small probabilities are bounded by the smallest possible number that can be represented in double precision. Thus, if the initial conditions are too far away from the correct values, the gradient of the negative log likelihood vanishes and the optimization cannot find the minimum. This problem is alleviated in the variable noise case, where we can choose a larger initial noise level since the optimization converges on the correct (zero) level. Furthermore, computation of the likelihood involves the subtraction of two error functions which can lead to truncation errors (see Dong et al, In Press).
7The exact number of simulations is of no importance; we stopped when we were convinced that we had characterized the methods in a satisfactory manner.