Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC3044782

Formats

Article sections

- Abstract
- I. Introduction
- II. Point Process Generalized Linear Model
- III. Maximum Likelihood Estimation and Regularization
- IV. Bayesian Inference and Variational Bayes Method
- V. Results
- VI. Discussion and Conclusion
- References

Authors

Related links

IEEE Trans Neural Syst Rehabil Eng. Author manuscript; available in PMC 2012 April 1.

Published in final edited form as:

Published online 2010 October 11. doi: 10.1109/TNSRE.2010.2086079

PMCID: PMC3044782

NIHMSID: NIHMS247752

Zhe Chen, Senior Member, IEEE,^{} David F. Putrino, Soumya Ghosh, Riccardo Barbieri, Senior Member, IEEE, and Emery N. Brown, Fellow, IEEE

Zhe Chen, Neuroscience Statistics Research Laboratory, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114 USA, and also with the Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139 USA;

Zhe Chen: ude.tim.tatsoruen@nehcehz; David F. Putrino: ude.tim.tatsoruen@10donirt; Soumya Ghosh: ua.ude.awu.enellyc@hsohgs; Riccardo Barbieri: ude.tim.tatsoruen@ireibrab; Emery N. Brown: ude.tim.tatsoruen@bne

The publisher's final edited version of this article is available at IEEE Trans Neural Syst Rehabil Eng

See other articles in PMC that cite the published article.

The ability to accurately infer functional connectivity between ensemble neurons using experimentally acquired spike train data is currently an important research objective in computational neuroscience. Point process generalized linear models and maximum likelihood estimation have been proposed as effective methods for the identification of spiking dependency between neurons. However, unfavorable experimental conditions occasionally results in insufficient data collection due to factors such as low neuronal firing rates or brief recording periods, and in these cases, the standard maximum likelihood estimate becomes unreliable. The present studies compares the performance of different statistical inference procedures when applied to the estimation of functional connectivity in neuronal assemblies with sparse spiking data. Four inference methods were compared: maximum likelihood estimation, penalized maximum likelihood estimation, using either _{2} or _{1} regularization, and hierarchical Bayesian estimation based on a variational Bayes algorithm. Algorithmic performances were compared using well-established goodness-of-fit measures in benchmark simulation studies, and the hierarchical Bayesian approach performed favorably when compared with the other algorithms, and this approach was then successfully applied to real spiking data recorded from the cat motor cortex. The identification of spiking dependencies in physiologically acquired data was encouraging, since their sparse nature would have previously precluded them from successful analysis using traditional methods.

Identifying the functional connectivity of a neuronal system using simultaneously recorded neural spike trains has provided valuable implications for understanding the system from a statistical perspective [6], [25]. Statistical modeling of neuronal data has been used for establishing statistical associations or causality between neurons, finding spatiotemporal correlations, or studying the functional connectivity in neuronal networks [10], [3], [50], [32], [31], [41], [14]. This analysis has many functional applications such as neural decoding, and assisting attempts to understand the collective dynamics of coordinated spiking cortical networks [49]. A statistical tool for analyzing multiple spike trains is the theory of random point processes. Statistical inference for point process observations often starts with a certain class of statistical (parametric or nonparametric) model, followed by parameter estimation using a statistical (either maximum likelihood or Bayesian) inference procedure [42], [5], [47]. To date, a number of statistical tools and models have been used to identify functional connectivity between ensemble neurons. The cross-correlogram and joint peri-stimulus time histogram (JPSTH) are standard (and possibly simplest) nonparametric methods for analyzing the interactions between pairwise neurons [35], [20], [1]. However, these tools have serious drawbacks: correlation-based analysis is limited to second-order spike count statistics, which is inadequate for neuronal spike trains. Further, these methods are nonparametric and there is no model validation or goodness-of-fit tests for the data. Recently, point process generalized linear models (GLMs) have been widely used for characterizing functional (spiking) dependence among ensemble neurons [10], [32], [47]. Specifically, the spiking probability of a particular neuron may be modeled as a function of the spiking history of concurrently recorded ensemble neurons (and possibly, a function of the input of the other stimuli as well), and the corresponding parameters of the point process GLM are inferred by maximum likelihood estimation. Bayesian inference has also been recently proposed for GLM inference on neural spike train data [36], [39], [48], [21], [42], [9]. Various approximation procedures have been developed, based on either Laplace approximation, expectation propagation (EP), Markov chain Monte Carlo (MCMC) sampling, or variational approximation.

To date, spike train modeling based upon the point process GLM has been used as a statistical tool in an effective manner to infer functional connections between groups of simultaneously recorded neurons [32]. However, physiological experiments that aim to record neural activity from awake, behaving animals are technically difficult to perform, and at times, the quantities of data that are collected during recording periods can be less than ideal. In addition, neurons that are recorded from animals or human subjects in anesthetized state often fire at very low spiking rates. Most traditional methods that analyze pairwise relationships between neurons simply cannot be used to reliably infer interactions when neural firing rates are low, or the number of trials of a behavioral task that is being studied are low. This presents a rather important problem in the field of neuroscience, as many sets of carefully acquired experimental data are then considered unusable for analysis (i.e., because insufficient trials of a difficult behavioral task were acquired, or recorded neurons are firing at an extremely low rate). Theoretically, the more recently developed model-based methods have the ability to reliably perform this analysis, but their effectiveness at being applied to this problem has not been examined. The present paper evaluates the ability of two different approaches to address this problem based on the point process GLM framework: the first one is penalized maximum likelihood estimation that uses _{2} or _{1} regularization, which aims to improve the generalization of the model while reducing the variance of the estimate; the second is hierarchical Bayesian estimation, which uses an efficient variational approximation technique that allows deterministic inference (without resorting to random MCMC sampling). The statistical algorithms under investigation are all capable of handling large-scale problems, but the present paper focuses on relatively small-scale data sets. The current paper focuses on the investigation of different inference algorithms rather than on different modeling paradigms for characterizing functional connectivity. It is worth noting that, in addition to the point process GLMs, other statistical models such as the maximum entropy model [38], [40] and the dynamical Bayesian network [15], are also useful complementary tools for inferring the spiking dependence among ensemble neurons.

A point process is a stochastic process with 0 and 1 observations [5], [12]. Let *c* = 1, ···*, C* denote the index of a multivariate (*C*-dimensional) point process. For the *c*th point process, let
${\mathit{y}}_{1:T}^{c}=({y}_{1}^{c},\dots ,{y}_{T}^{c})$ denote the observed response variables during a (discretized) time interval [1, *T*], where
${y}_{t}^{c}$ is an indicator variable that equals to 1 if there is a spike at time *t* and 0 otherwise. Therefore, multiple neural spike train data are completely characterized by a multivariate point process
${\{{\mathit{y}}_{1:T}^{c}\}}_{c=1}^{C}$. Mathematical backgrounds on the point process theory can be found in [12], [7].

In the framework of GLM [29], we assume that the observations {*y*_{1:}* _{T}*} follow an exponential family distribution with the form:

$${p}_{\theta}({y}_{t}\mid {\theta}_{t})=exp({y}_{t}{\theta}_{t}-b({\theta}_{t})+c({y}_{t})),$$

(1)

where *θ* denotes the canonical parameter, and *c*(*y _{t}*) is a normalizing constant. Assume that

$$g({\mu}_{t})={\eta}_{t}=\mathit{\beta}{\mathit{x}}_{t}$$

(2)

where *x** _{t}* denotes the input covariate at time

To model the spike train point process data, we use the following logistic regression model with the logit link function.^{1} Specifically, let *c* be the index of target neuron, and let *i* = 1, …*C* be the indices of trigger neurons. The Bernoulli (binomial) logistic regression GLM is written as:

$$\text{logit}({\pi}_{t})={\mathit{\beta}}_{c}{\mathit{x}}_{t}=\sum _{j=0}^{d}{\beta}_{j}^{c}{x}_{j,t}={\beta}_{0}^{c}+\sum _{i=1}^{C}\sum _{k=1}^{K}{\beta}_{i,k}^{c}{x}_{i,t-k}$$

(3)

where *dim*(*β** _{c}*) =

$${\pi}_{t}=\frac{exp({\mathit{\beta}}_{c}{\mathit{x}}_{t})}{1+exp({\mathit{\beta}}_{c}{\mathit{x}}_{t})}=\frac{exp({\beta}_{0}^{c}+{\sum}_{j=1}^{d}{\beta}_{j}^{c}{x}_{j,t})}{1+exp({\beta}_{0}^{c}+{\sum}_{j=1}^{d}{\beta}_{j}^{c}{x}_{j,t})}$$

(4)

which yields the probability of a spiking event at time *t*. It is seen from (4) that the spiking probability *π _{t}* is a logistic sigmoid function of

Equation (3) essentially defines a spiking probability model for neuron *c* based on its own spiking history, and that of the other neurons in the ensemble. It has been shown that such a simple spiking model is a powerful tool for the inference of functional connectivity between ensemble neurons [32], and in predicting single neuronal spikes based on collective population neuronal dynamics [49]. Here,
$exp({\beta}_{0}^{c})$ can be interpreted as the baseline firing probability of neuron *c*. Depending on the algebraic (positive or negative) sign of coefficient
${\beta}_{i,k}^{c},exp({\beta}_{i,k}^{c})$ can be viewed as a “gain” factor (dimensionless, > 1 or < 1) that influences the current firing probability of neuron *c* from another neuron *i* at the previous *k*th time lag. Therefore, a negative value of
${\beta}_{i,k}^{c}$ will strengthen the inhibitory effect and move *π _{t}* towards the negative side of the hyperplance; a positive value of
${\beta}_{i,k}^{c}$ will enhance the excitatory effect, and thereby moving

For the *c*th spike train point process data, we can write down the log-likelihood function:

$$L({\mathit{\beta}}_{c})=\sum _{t=1}^{T}\left[{y}_{t}^{c}log{\pi}_{t}({\mathit{\beta}}_{c})+(1-{y}_{t}^{c})log{\pi}_{t}(1-{\mathit{\beta}}_{c})\right]$$

(5)

Let ** θ** = {

$$L(\mathit{\theta})=\sum _{c=1}^{C}L({\mathit{\beta}}_{c}).$$

(6)

Note that the index *c* is uncoupled from each other in the network log-likelihood function, which implies that we can optimize the function *L*(*β** _{c}*) separately for individual spike train observations
${\mathit{y}}_{1:T}^{c}$. For simplicity, from now on we will drop off the index

The objective of the standard maximum likelihood estimation is to maximize (6) given all spike train data. It is known that when the data sample is sufficiently large, the maximum likelihood estimate (m.l.e.) is asymptotically unbiased, consistent, and efficient. However, when the number of samples is small, or the neural spiking rate is small (i.e., the number of ‘1’s in the observation ** y** is sparse), many empirical observations have indicated that m.l.e. produces either wrong or unreliable estimates. The error in the m.l.e. is typically related to two factors: bias and variance, and thus the ultimate goal of statistical estimation is to produce an unbiased minimum variance estimate. The issue of bias and large variance becomes more severe when a data set with a small sample size is encountered, and the size of the parameter space is relatively large. One way to reduce variance is through regularization, which aims to improve the generalization ability of the model (on new data) while fitting finite observed training data. The idea of regularization is to impose certain prior knowledge (such as sparsity) or physiologically plausible constraint (such as temporal smoothness) on the parameters [46], [23]. Furthermore, regularization can be interpreted as imposing a prior on the parameter space from an empirical Bayesian perspective, and the penalized log-likelihood will be interpreted as the log posterior density of the parameters [42], [8]. Therefore, penalized maximum likelihood estimation seeks to maximize a regularized log-likelihood function, which consists of a log-likelihood function plus a penalty function weighted by a regularization parameter. The resultant penalized m.l.e. can be viewed as a

First, let us consider the following penalized log-likelihood function using _{2}-regularization:

$${L}_{2}(\mathit{\beta})=L(\mathit{\beta})-\rho {\mathit{\beta}}^{\top}\mathit{Q}\mathit{\beta}$$

(7)

where *ρ* > 0 denotes the regularization parameter, and ** Q** denotes a user-defined positive semidefinite matrix. The use of the quadratic term

$${L}_{2}(\mathit{\beta})=L(\mathit{\beta})-\rho {\left|\right|\mathit{\beta}\left|\right|}_{2}.$$

(8)

Note that equations (7) and (8) are concave function of the parameter vector ** β**, and minimizing the negative penalized log-likelihood estimation is a convex optimization problem.

Once the regularization parameter *ρ* is determined (e.g., via cross-validation or regularization path), the optimization problem reduces to maximize a concave function of ** β**. A standard approach to minimize a convex function is through the Newton method. Specifically, let

$$\begin{array}{l}{\mathit{\beta}}_{n+1}={\mathit{\beta}}_{n}-{\mathbf{H}}^{-1}({\mathit{\beta}}_{n})\mathbf{g}({\mathit{\beta}}_{n})\\ ={\mathit{\beta}}_{n}+{[{\mathit{X}}^{\top}\mathit{W}({\mathit{\beta}}_{n})\mathit{X}+\rho \mathit{Q}]}^{-1}{\mathit{X}}^{\top}(\mathit{y}-\widehat{\mathit{y}}({\mathit{\beta}}_{n}))\end{array}$$

(9)

where ** ŷ**(

$$[{\mathit{X}}^{\top}\mathit{W}({\mathit{\beta}}_{n})\mathit{X}+\rho \mathit{Q}]{\mathit{\beta}}_{n+1}={\mathit{X}}^{\top}\mathit{W}({\mathit{\beta}}_{n})\mathit{b},$$

(10)

where ** b** =

Another popular regularization scheme is through _{1}-regularization, which penalizes the _{1} norm of the solution [44]. Unlike _{2} regularization, _{1} regularization favors the sparse solution (i.e., many coefficients in ** are zeros). From a decision-theoretic perspective, **_{2}-norm is a result of penalizing the mean of a Gaussian prior of the unknown variables, while an _{1} norm penalizes the median of a Laplace prior, which has heavier tails in its distribution shape. Specifically, the penalized log-likelihood function with _{1} norm regularization is written as

$${L}_{1}(\mathit{\beta})=L(\mathit{\beta})-\rho {\left|\right|\mathit{\beta}\left|\right|}_{1}$$

(11)

which is a concave function of ** β**, but is not twice differentiable with respect to

The interior-point method for maximizing *L*_{1}(** β**) in (11) aims to solve an equivalent optimization problem [27]:

$$\begin{array}{l}\text{minimize}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}-L(\mathit{\beta})+\rho {\mathbf{1}}^{\top}\mathit{u}\\ \text{subject}\phantom{\rule{0.16667em}{0ex}}\text{to}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}-{u}_{j}\le {\beta}_{j}\le {u}_{j},\phantom{\rule{0.38889em}{0ex}}j=1,\dots ,d,\end{array}$$

with variables *u** ^{d}*. The logarithmic barrier for the bound constraints

$$\mathrm{\Phi}(\mathit{\beta},\mathit{u})=-\sum _{j=1}^{d}log({u}_{j}^{2}-{\beta}_{j}^{2}),$$

(12)

with domain domΦ = {(*β**,* ** u**)

$$E(\mathit{\beta},\mathit{u})=-\kappa L(\mathit{\beta})+\kappa \rho {\mathbf{1}}^{\top}\mathit{u}+\mathrm{\Phi}(\mathit{\beta},\mathit{u})$$

(13)

where *κ* > 0 is a scalar parameter that defines the central path of a curve of *E*(*β**,* ** u**). The new function defined in (13) is smooth and strictly convex, and it can be optimized using the Newton or CG method. Increasing the values of

Upon completing the standard or penalized likelihood inference, we obtain the parameter estimate **. Let**

$$\mathit{\sum}\approx I{(\mathit{\beta})}^{-1}=-\mathbb{E}{\left[\frac{{\partial}^{2}L}{\partial \widehat{\mathit{\beta}}\partial {\widehat{\mathit{\beta}}}^{\top}}\right]}^{-1}$$

(14)

denote the inverse of the negative Hessian matrix of the log-likelihood estimated from the ensemble samples. From the property of m.l.e. it is known that **Σ** approximates the inverse of the Fisher information matrix *I*(** β**); in addition, under the regularity condition and large sample assumption, the m.l.e.

To estimate the matrix **Σ** in (14), in the case of standard maximum likelihood estimation for the GLM, upon convergence we can derive that **Σ** = (*X*^{}** W**(

$$\mathbf{H}=\left[\begin{array}{ccc}\kappa {\mathbf{1}}^{\top}\mathit{W}\mathbf{1}& \kappa {\mathbf{1}}^{\top}\mathit{W}{\mathit{X}}_{-}& \mathbf{0}\\ \kappa {\mathit{X}}_{-}^{\top}\mathit{W}\mathbf{1}& \kappa {\mathit{X}}_{-}^{\top}\mathit{W}{\mathit{X}}_{-}+{\mathit{D}}_{1}& {\mathit{D}}_{2}\\ \mathbf{0}& {\mathit{D}}_{2}& {\mathit{D}}_{1}\end{array}\right]\in {\mathbb{R}}^{(2d+1)\times (2d+1)}$$

(15)

where ${\mathit{D}}_{1}=\text{diag}\{{\scriptstyle \frac{2({u}_{1}^{2}+{\beta}_{1}^{2})}{{({u}_{1}^{2}-{\beta}_{1}^{2})}^{2}}},\dots ,{\scriptstyle \frac{2({u}_{d}^{2}+{\beta}_{d}^{2})}{{({u}_{d}^{2}-{\beta}_{d}^{2})}^{2}}}\}$, and ${\mathit{D}}_{2}=\text{diag}\{{\scriptstyle \frac{-4{u}_{1}{\beta}_{1}}{{({u}_{1}^{2}-{\beta}_{1}^{2})}^{2}}},\dots ,{\scriptstyle \frac{-4{u}_{d}{\beta}_{d}}{{({u}_{d}^{2}-{\beta}_{d}^{2})}^{2}}}\}$. In light of the Schur complement, we obtain

$$\mathit{\sum}={\left[\begin{array}{cc}\kappa {\mathbf{1}}^{\top}\mathit{W}\mathbf{1}& \kappa {\mathbf{1}}^{\top}\mathit{W}{\mathit{X}}_{-}\\ \kappa {\mathit{X}}_{-}^{\top}\mathit{W}\mathbf{1}& \kappa {\mathit{X}}_{-}^{\top}\mathit{W}{\mathit{X}}_{-}+{\mathit{D}}_{3}\end{array}\right]}^{-1}$$

(16)

where ${\mathit{D}}_{3}={\mathit{D}}_{1}-{\mathit{D}}_{2}{\mathit{D}}_{1}^{-1}{\mathit{D}}_{2}$.

To quantify the connectivity among *C* neurons, from (3) we define the mean connectivity ratio as follows:

$$\mathit{ratio}=\frac{1}{K}\sum _{c=1}^{C}\sum _{i=1}^{C}\sum _{k=1}^{K}\frac{\#\{\mid {\beta}_{i,k}^{c}\mid \phantom{\rule{0.16667em}{0ex}}\gg 0\}}{C(C-1)}.$$

(17)

where
$\#\{\mid {\beta}_{i,k}^{c}\mid \phantom{\rule{0.16667em}{0ex}}\gg 0\}$ denotes that the number of the coefficient
${\beta}_{i,k}^{c}$ whose statistical estimates are significantly nonzero. Note that the spiking dependence is directional and asymmetric in our statistical model, the spiking dependence between A→B and B→A is not necessarily the same. Therefore, for a total of *C* ensemble neurons, there are possibly (*C*^{2} −*C*) directions (excluding *C* self-connection coefficients) between all neuron pairs.

In addition to the maximum likelihood estimation, another appealing statistical inference tool is Bayesian estimation. The goal of Bayesian inference is to estimate the parameter posterior *p*(** β**|

Specifically, let ** α** denote the hyperparameter set, and we can derive

$$\begin{array}{l}logp(\mathit{y}\mid \mathit{x})=log\int \int p(\mathit{y}\mid \mathit{x},\mathit{\beta})p(\mathit{\beta}\mid \mathit{\alpha})p(\mathit{\alpha})\text{d}\mathit{\beta}\text{d}\mathit{\alpha}\\ \ge \int \int q(\mathit{\beta},\mathit{\alpha})log\frac{p(\mathit{y}\mid \mathit{x},\mathit{\beta})p(\mathit{\beta}\mid \mathit{\alpha})p(\mathit{\alpha})}{q(\mathit{\beta},\mathit{\alpha})}\text{d}\mathit{\beta}\text{d}\mathit{\alpha}\equiv \stackrel{\sim}{L},\end{array}$$

(18)

where *p*(** β**|

A variational approximation algorithm for logistic regression has been developed in the field of machine learning [24], and it can be easily extended to the Bayesian setting [2]. The basic idea of variational approximation is to derive a variational lower bound for the marginal log-likelihood function. However, the hyperparameters used in [24] are fixed a priori, so their model is empirical Bayesian. Here, we extend the model with hierarchical Bayesian modeling using *automatic relevance determination* (ARD) [30] for the purpose of variable selection. Such a fully Bayesian inference integrated with ARD allows us to design a separate prior for each element *β _{j}* in the vector

$$\begin{array}{l}p(\mathit{\beta}\mid \mathit{\alpha})\sim \mathcal{N}(\mathit{\beta}\mid {\mathit{\mu}}_{0},{\mathit{A}}^{-1})\propto exp\left(-\frac{1}{2}{(\mathit{\beta}-{\mathit{\mu}}_{0})}^{\top}\mathit{A}(\mathit{\beta}-{\mathit{\mu}}_{0})\right),\\ p(\mathit{\alpha})=\prod _{j=0}^{d}\text{Gamma}({\alpha}_{j}\mid {a}_{0},{b}_{0}),\end{array}$$

where ** A** = diag{

Let ** ξ** = {

$$\begin{array}{l}logq(\mathit{\beta}\mid \mathit{y})=log\stackrel{\sim}{p}(\mathit{\beta},\mathit{\xi})+{\mathbb{E}}_{q(\mathit{\alpha})}[logp(\mathit{\beta}\mid \mathit{\alpha})]\\ =log\mathcal{N}(\mathit{\beta}\mid {\mathit{\mu}}_{T},{\mathit{\sum}}_{T}),\end{array}$$

(19)

$$\begin{array}{l}logq(\mathit{\alpha}\mid \mathit{y})={\mathbb{E}}_{q(\mathit{\beta})}[logp(\mathit{\beta}\mid \mathit{\alpha})]+logp(\mathit{\alpha})\\ =log\left\{\prod _{j=0}^{d}\text{Gamma}({\alpha}_{j}\mid {a}_{T},{b}_{j,T})\right\},\end{array}$$

(20)

which follow from updates from conjugate priors and posteriors for the exponential family (Gaussian and Gamma distributions). The term (*β**,* ** ξ**) appearing in (19) denotes the variational likelihood bound for logistic regression:

$$\begin{array}{l}logp(\mathit{\beta},\mathit{\xi})\ge log\stackrel{\sim}{p}(\mathit{\beta},\mathit{\xi})\\ =\sum _{t=1}^{T}\left(log\sigma ({\xi}_{t})-\frac{{\xi}_{t}}{2}+\phi ({\xi}_{t}){\xi}_{t}^{2}\right)-[{\mathit{\beta}}^{\top}(\phi ({\xi}_{t}){\mathit{x}}_{t}{\mathit{x}}_{t}^{\top})\mathit{\beta}]+{\mathit{\beta}}^{\top}\left({\mathit{x}}_{t}{y}_{t}-\frac{1}{2}{\mathit{x}}_{t}\right),\end{array}$$

(21)

where *σ*(·) is a logistic sigmoid function. The variational likelihood bound at the right-hand side of (21) has a quadratic form in terms of ** β**, and therefore it can be approximated by a Gaussian likelihood (which results in equation 19). The other terms appearing in (19) through (21) are defined by

$$\begin{array}{l}{\xi}_{t}=\sqrt{{\mathit{x}}_{t}^{\top}({\mathit{\sum}}_{T}+{\mathit{\mu}}_{T}{\mathit{\mu}}_{T}^{\top}){\mathit{x}}_{t}}\\ \phi ({\xi}_{t})=\frac{tanh({\xi}_{t}/2)}{4{\xi}_{t}}\\ {\mathit{\sum}}_{T}^{-1}={\mathbb{E}}_{q(\mathit{\alpha})}[\mathit{A}]+2\sum _{t=1}^{T}\phi ({\xi}_{t}){\mathit{x}}_{t}{\mathit{x}}_{t}^{\top}\\ {\mathit{\mu}}_{T}={\mathit{\sum}}_{T}\left({\mathbb{E}}_{q(\mathit{\alpha})}[\mathit{A}]{\mathit{\mu}}_{0}+\sum _{t=1}^{T}({y}_{t}-0.5){\mathit{x}}_{t}\right)\\ {\mathbb{E}}_{q(\mathit{\alpha})}[\mathit{A}]=\text{diag}\{{a}_{T}/{b}_{j,T}\}\equiv {\mathit{A}}_{T}\\ {a}_{T}={a}_{0}+0.5\\ {b}_{j,T}={b}_{0}+0.5[{({\mathit{\mu}}_{T})}_{j}^{2}+{({\mathit{\sum}}_{T})}_{jj}],\end{array}$$

where the subscript *T* in the updated parameters represents the fact that the parameters and hyperparameters are updated after passing a total of *T* samples. Finally, we can derive the variational lower bound of marginal log-likelihood (Appendix B):

$$\stackrel{\sim}{L}=\frac{1}{2}\left\{{\mathit{\mu}}_{T}^{\top}{\mathit{\sum}}_{T}^{-1}{\mathit{\mu}}_{T}+log\mid {\mathit{\sum}}_{T}\mid +\phantom{\rule{0.16667em}{0ex}}\sum _{t=1}^{T}\left(2log\sigma ({\xi}_{t})-{\xi}_{t}+2\phi ({\xi}_{t}){\xi}_{t}^{2}\right)\right\}+\sum _{j=0}^{d}\left\{-log\mathrm{\Gamma}({a}_{0})+{a}_{0}log{b}_{0}-{b}_{0}\frac{{a}_{T}}{{b}_{j,T}}-{a}_{T}log{b}_{j,T}+log\mathrm{\Gamma}({a}_{T})+{a}_{T}\right\}.$$

(22)

The VB inference alternatingly updates (19) and (20) to monotonically increase . The criterion for algorithmic convergence is set until the consecutive change of (22) is sufficiently small (say 10^{−4}). Upon completing the VB inference, the confidence bounds of the estimates can be derived from the posterior mean and the posterior variance [2].

It is noted that due to the assumed factorial form of posterior distribution, the variance of the estimates is relatively underestimated [2]. However, this will have little effect on the identification result. While using the ARD for variable selection, a nonsignificant coefficient is said to be pruned if its mean and variance estimates are both small (close to 0). Therefore, even if the variance is slightly underestimated, provided that the mean estimate value is is relatively large (or the solution is non-sparse), it will not change the inferred result.

Data simulations were used to compare the performance of different statistical inference procedures: (i) the standard ML method, (ii) penalized ML (PML) with _{2} regularization, (iii) PML with _{1} regularization, and (iv) hierarchical VB (HVB) method. A summary of these methods is given in Table II. Based on the performance of these methods with the simulated data, the optimal statistical inference method was also applied to real spike train data. All of the custom statistical inference algorithms were written in MATLAB^{©} (MathWorks, Natick, MA) and can be accessed online: http://neurostat.mit.edu/software. The software of the _{1} regularized logistic regression [27] was accessible from http://www.stanford.edu/~boyd/l1_logreg/.

The goodness-of-fit of the point process models estimated from all algorithms is evaluated based on the *Time-Rescaling Theorem* and *Kolmogorov-Smirnov* (KS) test [4], [5].^{3} Assuming that a univariate point process specified by *J* discrete events: 0 < *u*_{1} < ··· < *u _{J}* <

In simulation studies, in addition to the KS test, we also compute the mis-identification error rate, which is the sum of the the false positive (FP) and false negative (FN) rates. By false positive, it is meant that the true connection coefficient (only known in simulations) is zero, but its statistical estimate from the algorithm is mistakenly identified as being significantly nonzero. By false negative, it is meant that the true connection coefficient is nonzero, but the statistical estimate from the algorithm is mistakenly identified as being zero. In simulation studies, we also compute the mean-squared error (MSE) and the normalized MSE (NMSE) of the estimate, which are defined as

$$\text{MSE}=\frac{1}{C}\sum _{c=1}^{C}{\left|\right|{\mathit{\beta}}_{c}-{\widehat{\mathit{\beta}}}_{c}\left|\right|}_{2},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\text{NMSE}=\frac{1}{C}\sum _{c=1}^{C}\frac{{\left|\right|{\mathit{\beta}}_{c}-{\widehat{\mathit{\beta}}}_{c}\left|\right|}_{2}}{{\left|\right|{\mathit{\beta}}_{c}-{\overline{\beta}}_{c}\left|\right|}_{2}},$$

where * _{c}* denotes the estimate of the true (simulated)

Above all, we like to compare the *bias* and *variance* of the estimates computed from different statistical inference algorithms. Roughly speaking, MSE and RMSE provide two measures of the estimate’s bias. The KS statistics on the testing or validation data as well as the mis-identification rate provide essential information about the estimate’s variance. As far as functional connectivity is concerned in the present study, the FP and FN rates are particularly relevant. Ideally, a low mis-identification error rate and a lower KS statistics would be the most important criterion for choosing the best algorithm.

With the simulation data, a systematic investigation of algorithmic performance was conducted under different conditions. We considered a relatively small network that consisted of 10 simulated neurons with varying connectivity ratios (five scales: 0.1, 0.2, 0.3, 0.4, 0.5). All neurons are assumed to have roughly equal baseline firing rates (four scales: 5, 10, 15, 20 spikes/s), and the nonzero history-dependent coefficients were uniformly randomly generated between [−*h, h*] (where *h* was set to be inversely proportional to the baseline firing rate). To create a small spike train data set with short recordings, we simulated multivariate spike trains under the same condition for 8 independent trials, each trial lasting 1000 ms. The input covariates consisted of spike counts from previous temporal windows (equal size of 5 ms) up to 80 ms from all 10 neurons. Thus, there were 16 spike counts from each neuron, totaling *d* = 16 × 10 = 160 covariates (*dim*(*β** _{c}*) = 161 because of an additional baseline shift parameter). Under each condition, we simulated 20 Monte Carlo runs with varying random seeds. In each condition, we also generated an independent set of trials, which are reserved for testing (or cross-validation). The KS statistics for fitting the training (KStrain) and testing (KStest) data were both computed.

Since the simulated coefficients are random and have no smoothness structure (which may also be the case in real biological systems), we only used the standard _{2} regularization by setting ** Q** =

Comparison of inference methods on MSE and NMSE statistics. Note that the y-axis in the first two columns are in log-scale.

The following observations summarize our findings:

- Testing the algorithms began with the training data, and the KS statistics for each algorithm’s performance using the same training data under the different baseline firing rate conditions are displayed in Fig. 1. Across all algorithms, differences in KS statistic were only mildly affected by the varying connectivity ratio in this case. As expected, the KS statistics do decrease as the baseline firing rates of the virtual neurons increase (since more spikes were generated). Interestingly, when the baseline rate was at 5 or 10 Hz, the standard ML algorithm had the highest KS statistics in the training data, but when the baseline rate was increased (at 15 and 20 Hz), its performance became comparable to that of the other two penalized ML algorithms. In all cases, the HVB algorithm performed better, and the majority of the neuron fits fall within the 95% confidence bounds. The HVB’s superior KS statistics are more apparent with higher firing rate than with lower firing rates. Because all neurons have similar baseline firing rate, the KS scores from all neurons are also very close in value.
- The algorithms were then applied to the testing data, and the KS statistics were computed for testing the generalization ability on unseen data from each estimate (Fig. 2). The ML and HVB algorithms seem to be more robust regardless of the connectivity ratio, while the KS statistics of two penalized ML algorithms appeared to be affected by changes of the connectivity ratio. However, no general trend was observed. Again, the HVB algorithm exhibited the best performance in all of the tested conditions, followed by the
_{1}-PML algorithm. As a general observation, the two PML algorithms and HVB algorithm all showed noticeable advantages over the standard ML inference, but this was expected, as it was anticipated that the ML estimate would lose its desirable asymptotic properties in a “small data set” scenario. However, while the two PML algorithms clearly outperform the standard ML inference only under low baseline firing rate conditions, the HVB has showed significantly improved performances regardless of the baseline rate. The results also confirm that when the firing rate is high, and the number of ‘1s’ in the point process observations is large, the KS statistics from all (standard or regularized) ML algorithms are similar for both training and testing data sets. Summarizing Figs. 1 and and2,2, we have seen HVB has better performance than ML and PML methods in both fitting accuracy (training data) and generalization (testing data), under varying connectivity ratios and varying baseline firing rates. Specifically, the advantage of HVB is most prominent in the low connectivity ratio and/or the medium-to-high firing condition. - We next considered how the mis-identification (FP+FN) error rate changes as the connectivity ratio increases under different baseline firing rate conditions (Fig. 3). Once again, the overall best performance was produced by use of the HVB algorithm, followed by the
_{2}-PML algorithm, and finally the standard ML algorithm (the_{1}PML algorithm was excluded from the comparison here because the codes provided in [27] do not produce the confidence bound of the estimates). Under the low baseline firing rate (5 Hz), the PML and HVB had almost identical mis-identification error rates, however, as the baseline firing rate gradually increased, the gap between the PML and HVB algorithms also increased. In our simulations, it was also found that there is an inverse relationship between FP and FN error: an algorithm that had a low FP error rate typically had a relatively greater FN error. A relationship also existed between connectivity ratio and (FP+FN) error rate that occurred regardless of the baseline firing rate, as the connectivity ratio increases to 0.5, the (FP+FN) error rate reaches an asymptotic level for all tested algorithms. This implies that there is a “bottleneck” limit for identifying the true functional connectivity for all algorithms. Therefore, in the worst case, the (FP+FN) error rate can be close to a random guess (50%). The high FN error rate may be due to the fact that many of the “weak connections” in the simulations are mistakenly identified as not showing connections.^{4}In the case where a sparse solution is favored (i.e., in_{1}-PML and HVB algorithms), weak connectivity coefficients will be pushed towards zero, while only the stronger connections (which matter more) will reach significance. However, as seen in simulations, at about 10–15 Hz baseline rate and with 0.3 connectivity ratio, the mis-identification error rate is still relatively high (~30%) even for the HVB algorithm. Therefore, it remains a challenging task to identify the weak connections (especially weak negative connections where most mis-identification errors occur) in statistical inference. - Our final comparison using the simulated data involved measuring the bias associated with each algorithm’s estimate using MSE and NMSE metrics (Fig. 4). The PML and HVB approaches showed much better performance in MSE and NMSE than the standard ML method, however, the HVB algorithm again performed the best. In most of the testing conditions (except for 20 Hz baseline rate),
_{2}-PML is better than_{1}-PML. This is because_{1}regularization favors a sparse solution, which forces many small or weak connection coefficients’ estimates towards zero, possibly causing an increase in the bias of the estimate. As also expected,_{1}regularization had a smaller variance than_{2}regularization (see Fig. 2 on KStest statistics). - Overall, the HVB method achieved the best performance in all categories: KS statistics (in both training and testing spike train data sets), MSE and NMSE, as well as the mis-identification error rate. Moreover, in terms of computational speed (to achieve algorithmic convergence), it is observed that the standard ML and
_{2}-PML algorithms have the fastest convergence, while the_{1}-PML and HVB algorithms have comparable convergence rates, roughly 3 ~ 6 longer (depending on specific simulation) than the other two algorithms. In our experiments, all algorithms converged in all simulated conditions. Generally, when the connectivity ratio is higher or the spiking rate is lower, the convergence speed of_{1}-PML is slower; but the convergence speed of HVB remains very similar regardless of the connectivity ratio and spiking rate. Furthermore, when*N*-fold cross-validation was used to select the suboptimal regularization parameter in the_{1}and_{2}PML algorithms, the total*N*-run computational cost and time could be much greater when compared with the single-run HVB algorithm. Interestingly,_{2}PML achieved similar performance to HVB in the mis-identification error, especially in the low firing rate condition. However, comparing the KStrain and KStest statistics and MSE/NMSE indicates that HVB has significantly better performance. This might suggest that PML is effective in finding a “yes/no” solution, but not an accurate solution; in statistical jargon, the PML methods have a good “discrimination” but a relatively poor “calibration”. However, the discrimination ability of PML gradually decreases as the firing rate increases.

Additional issues of interest are discussed below.

The effect of the data size on algorithmic performance was also examined. This was accomplished by keeping the size of the parameter space intact, but doubling and tripling the size of the training data set, and comparing the performance of the different algorithms. To illustrate the method, we have used a medium (0.3) connectivity ratio, and computed the KS statistics (only on testing data), mis-identification error rate, and MSE for all of the tested algorithms. The results are summarized in Table III, and mixed results amongst the different algorithms are evident. For the standard ML method, increasing data size improves the KS statistic and MSE, but not necessarily the mis-identification error rate. For the penalized ML methods, increasing data size either mildly improves or does not change the MSE or KS statistic, and has a mixed effect on mis-identification error rate. For the HVB method, increasing data size improves the KS statistic but has very little effect on the MSE and mis-identification error rate. These observations suggest that the results obtained using the HVB method are very robust with regard to the data sample size, which is particularly appealing for the small data set problem.

Except for the standard ML algorithm, the other three algorithms have some additional free parameters (see the last column in Table II). The regularization parameter *ρ* needs to be selected from cross-validation. The *κ* parameter is set in a way that it is gradually increased in the interior-point method in a systematic way [27]. In HVB, the two hyperprior parameters *a*_{0} and *b*_{0} control the shape of the gamma prior (which influences the sparsity of the solution). A close examination using simulation data further revealed that the KS statistics are somewhat insensitive to the choices of *a*_{0} and *b*_{0}, although their values may change the respective FP or FN error rate. However, given a wide range of values for (*a*_{0}*, b*_{0}), the total (FP+FN) error rate remains roughly stable. This suggests that changing the hyperprior parameters of the HVB algorithm will potentially change the desirable sparsity of the solution, which will further affect the trade-off between the FP and FN error. As an illustration, Figure 5 presents the Monte Carlo averaged (across 10 independent runs) performances on the KS statistics and mis-identification error by varying different values of *a*_{0} and *b*_{0} in the HVB algorithm. In these simulations, the connectivity ratio was chosen to be 0.3 and the baseline firing rate was fixed at 10 Hz. As seen from Fig. 5, the performance of the HVB algorithm is relatively robust to a variety range of the hyperprior parameters. In this example, according to the averaged KStrain statistics (minimum 0.110), the optimal set up is (*a*_{0}*, b*_{0}) = (10^{−3}*,* 10^{−3}); according to the averaged KStest statistics (minimum 0.146), the optimal setup is (*a*_{0}*, b*_{0}) = (10^{−4}*,* 10^{−4}); according to the averaged (FP+FN) error rate (minimum 24.5%), the optimal setup is (*a*_{0}*, b*_{0}) = (10^{−2}*,* 10^{−4}). In practice, we have found that the range (*a*_{0}*, b*_{0}) [10^{−4}*,* 10^{−2}] consistently achieved good performance.

Comparative performances (mean statistics averaged over 10 Monte Carlo runs) on the KS statistics (a, b), FP error (c), FN error (d), and (FP+FN) error (e) with varying values of hyperprior parameters *a*_{0} and *b*_{0} in the HVB algorithm.

Overall, it was found in our simulations (with various setup conditions) that in the presence of small connectivity ratio, high spiking rate and a large number of spiking data, all tested algorithms produce similar KS statistics and mis-identification error. In the other conditions, the HVB algorithm always has an obviously superior margin

The best statistical inference method, the HVB algorithm, was then applied to real spike train data. This experimental data featured groups of neurons that were simultaneously recorded from the primary motor cortex of an awake, behaving cat sitting quietly in a Faraday cage. The experimental protocol for data acquisition, and the behavioral paradigm has been previously reported in detail [22]. In the current study, real neural data is used purely for demonstration purposes, and thus we used recordings from only one animal during a period of motor inactivity (i.e., baseline firing activity) so that neural firing rates were as low as possible (to purposely create sparse spiking data sets). Specific physiological details of the data used for this analysis are provided in Table IV.

Three independent recording sessions were used in this analysis. The first, second and third data sets consist of 13, 15 and 15 neurons, and 18, 18 and 17 trials, respectively, and each trial was 3000 ms in length. The M1 neurons in these data sets were classified as either regular-spiking (RS) or fast-spiking (FS) neurons based upon extracellular firing features. Many of the neurons in these data sets had very low firing rates during the trial periods (Table IV), and the short data recordings and low firing rate fit the two key criteria of the “small data problem”. In Fig. 6, we show representative spike rasters and inter-spike interval (ISI) histograms from one RS and one FS neuron. To estimate the functional connectivity amongst the ensemble neurons, we have assumed that the spiking dependence among the cells remain unchanged across different trials.

Spike rasters and inter-spike interval (ISI) histograms of two representative M1 neurons (Dataset-1): (a) a RS neuron (mean firing rate: 3.7 Hz), (b) a FS neuron (mean firing rate: 13 Hz). The legends in the ISI histograms denote the median ISI values. **...**

We binned the spike trains with 1 ms temporal resolution, and obtained multivariate point process observations. From the observed ISI histograms, we chose the spiking history up to 100 ms and selected 8 history bins (unit: ms) with the following setup:

$$[1\sim 3,4\sim 10,11\sim 20,21\sim 30,31\sim 40,41\sim 60,61\sim 80,81\sim 100]$$

The first spiking history window is chosen to capture the refractory period of the spiking property. For a total of *C* neurons, the size of parameters for fitting each neuron’s spike train recordings is *dim*(** β**) = 8

Illustrations of pairwise neuronal spiking dependence between one RS neuron (#5) and one FS neuron (#10), using the raw (top left) and corrected (top right) joint peri-stimulus time histograms (JPSTHs) and the inferred bi-directional (bottom) spiking **...**

Inferring functional connectivity of a neuronal network using simultaneously recorded spike trains is an important task in computational neuroscience, and the point process GLM provides a principled framework for statistical estimation. However, in addition to employing a favorable statistical model, the selection of an appropriate statistical inference algorithm is also a crucial undertaking. The present study aimed to solve a problem that has been present in the practice of experimental brain recording for many years: the reliability of sparse data sets for analysis [43]. Thus, this paper investigates several statistical inference procedures, while also applying these methods to the sparse spiking data. Many sets of experimental data are not appropriate for analysis of spiking dependence either because they feature neurons that are spiking at very low frequencies, or because during a difficult behavioral experiment, too few trials of the desired behavior are secured. Essentially, improving the reliability of the statistical estimates was the focus of our study, and simulated spike train data were used to compare different statistical inference algorithms. The four algorithms that were tested were the standard ML method, the _{2} and _{1} PML methods, and the HVB method. Systematic investigations were conducted to compare the performance of the algorithms (in terms of the KS statistics, MSE, and mis-identification error) under different conditions (with varying baseline firing rate and network connectivity ratio). From the Monte Carlo estimation results we conclude that (i) the HVB algorithm performed the best amongst all tested algorithms, and (ii) regularization is very important for the maximum likelihood estimation, especially in the presence of neurons with low firing rate. As an illustration, we apply the HVB algorithm to real spike train recordings from ensemble M1 neurons.

The hierarchical Bayesian method has been shown to be a powerful tool for statistical inference [18], whose applications have gone far beyond the point process GLM. The VB inference is appealing for Bayesian inference from the perspective of computational efficiency, with the goal of maximizing the lower bound of the marginal log-likelihood of observed data. The ARD principle employed in the Bayesian inference procedure provides a natural way for variable selection in that redundant or insignificant features will be shown to have smaller weights (close to zeros), thereby favoring a sparse solution. The sparsity of the solution is controlled by the hyperprior parameters, which can be set to be non-informative. Finally, the full posteriors of the parameters can be obtained, which can be used to compute a predictive distribution of unseen data [2].

The framework of the point process GLM and the HVB method provide a new way to investigate the neural interactions of ensemble neurons using simultaneously recorded spike trains. The point process GLM using the collective neuronal firing history has been shown to be effective in predicting single neuron spikes from humans and monkeys [49], which has potential applications for neural prosthetic devices. In addition, our proposed methodology can be used for assessing neural interactions. Since the point process GLM using a network likelihood function enabled us to assess spiking dependencies in populations of simultaneously recorded neurons, our approach is favorable when compared with traditional techniques (e.g., cross-correlation or JPSTH), as it may be used to examine functional connectivity as it occurs in multiple neurons simultaneously, compared with only being able to perform pairwise analysis. This is appealing for examining neural interactions at different regions of the brains, or for conducting quantitative comparison during different behaviors or task performances. The findings of our study indicates that the proposed HVB method provides a satisfactory solution to the “sparse spiking data problem” faced by many neuroscience researchers, and this method appears to outperform other contemporary statistical inference procedures in the assessment of functional connectivity in sets of spike train data where sparsity is not an issue.

Similar to other contemporary statistical approaches, our method for inferring the functional connectivity of ensemble neurons relies on certain statistical assumptions. For example, the stationarity of neuronal data during short durations of trials as well as across trials. Whilst this assumption is not always valid between trials, the non-stationarity issue across trials can be addressed by considering a random-effects GLM [11], and maximum likelihood and Bayesian inference procedures can be developed accordingly.

This work was supported by the National Institutes of Health (NIH) under Grant DP1-OD003646, Grant R01-DA015644, and Grant R01-HL084502.

Assuming that the spiking history dependent coefficients change smoothly between the neighboring temporal windows), we may impose a “local smoothness” constraint on the parameters. Heuristically, when the parameter sequences {*β _{c,k}*} are temporally smooth for any index

$${\overline{\beta}}_{c,k}=\gamma {\overline{\beta}}_{c,k-1}+(1-\gamma ){\beta}_{c,k}$$

(23)

where 0 < *γ* < 1 is a forgetting factor that determines the range of local smoothness. The role of *γ* is to act like a low-pass filter: the smaller the value of *γ*, the more emphasis is placed on the *β _{c,k}*, and a smaller smoothing effect emerges. Let us define a new quadratic penalty function from (23):

$$\sum _{c}\sum _{k}{({\beta}_{c,k}-{\overline{\beta}}_{c,k})}^{2}=\sum _{c}\sum _{k}\gamma {({\beta}_{c,k}-{\overline{\beta}}_{c,k-1})}^{2},$$

(24)

which penalizes the local variance of {*β _{c,k}*}. Let

$${\left|\right|{\mathit{\beta}}_{c}-{\overline{\mathit{\beta}}}_{c}\left|\right|}^{2}=\phantom{\rule{0.16667em}{0ex}}{\left|\right|{\mathit{\beta}}_{c}-\mathbf{S}{\mathit{\beta}}_{c}\left|\right|}^{2},$$

(25)

where a smoothing matrix **S** is introduced to represent * _{c}* in terms of

$$\mathbf{S}=\left[\begin{array}{cccccc}1-\gamma & 0& 0& 0& \cdots & 0\\ \gamma (1-\gamma )& 1-\gamma & 0& 0& \cdots & 0\\ {\gamma}^{2}(1-\gamma )& \gamma (1-\gamma )& 1-\gamma & 0& \cdots & 0\\ {\gamma}^{3}(1-\gamma )& {\gamma}^{2}(1-\gamma )& \gamma (1-\gamma )& 1-\gamma & 0& \cdots \\ 0& {\gamma}^{3}(1-\gamma )& {\gamma}^{2}(1-\gamma )& \gamma (1-\gamma )& 1-\gamma & \cdots \\ \phantom{\rule{0.16667em}{0ex}}& \phantom{\rule{0.16667em}{0ex}}& \cdots & 0& \ddots & \vdots \end{array}\right]$$

Finally, we obtain the regularization matrix ** Q** =

$$\mathbf{P}=\left[\begin{array}{cccc}{\mathbf{P}}_{1}& 0& \cdots & 0\\ 0& {\mathbf{P}}_{2}& \cdots & 0\\ 0& \cdots & \ddots & \vdots \\ 0& \cdots & 0& {\mathbf{P}}_{C}\end{array}\right]=\left[\begin{array}{cccc}\mathbf{I}-\mathbf{S}& 0& \cdots & 0\\ 0& \mathbf{I}-\mathbf{S}& \cdots & 0\\ 0& \cdots & \ddots & \vdots \\ 0& \cdots & 0& \mathbf{I}-\mathbf{S}\end{array}\right].$$

(26)

where *dim*(**I** − **S**) = *K, dim*(** Q**) =

As already mentioned, we may use the contingent smoothness operator as in [46], in which
${\{{\beta}_{c,k}\}}_{k=1}^{K}$ is viewed as a curve, and the first-order derivative of the curve is approximated by *β _{c,k}* −

$${L}_{2}({\mathit{\beta}}_{c})=L({\mathit{\beta}}_{c})-\rho \sum _{k}{\left|\right|{\beta}_{c,k}-{\beta}_{c,k-1}\left|\right|}^{2},$$

(27)

and the *i*th block (*i* = 1*, …, C*) of the block-diagonal matrix **P** in (27) is derived as

$${\mathbf{P}}_{i}=\left[\begin{array}{cccccc}1& -1& 0& 0& \cdots & 0\\ 0& 1& -1& 0& \cdots & 0\\ 0& 0& \ddots & \ddots & \ddots & \vdots \\ 0& \cdots & \phantom{\rule{0.16667em}{0ex}}& 0& 1& -1\end{array}\right],\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\text{and}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}{\mathbf{P}}_{i}^{\top}{\mathbf{P}}_{i}=\left[\begin{array}{cccccc}-1& 2& -1& 0& \cdots & 0\\ 0& -1& 2& -1& \cdots & 0\\ \phantom{\rule{0.16667em}{0ex}}& \phantom{\rule{0.16667em}{0ex}}& \ddots & \ddots & \ddots & \phantom{\rule{0.16667em}{0ex}}\\ \phantom{\rule{0.16667em}{0ex}}& \phantom{\rule{0.16667em}{0ex}}& \phantom{\rule{0.16667em}{0ex}}& -1& 2& -1\end{array}\right],$$

which both have a banded-diagonal structure.

From equations (18) and (21), we can derive the variational lower bound of the marginal log-likelihood (for notation simplicity, the dependence of ** y** in all posteriors is made implicit):

$$\stackrel{\sim}{L}={\mathbb{E}}_{q(\mathit{\beta})}[log\stackrel{\sim}{p}(\mathit{y}\mid \mathit{\beta})]+{\mathbb{E}}_{q(\mathit{\beta})q(\mathit{\alpha})}[logp(\mathit{\beta}\mid \mathit{\alpha})]-{\mathbb{E}}_{q(\mathit{\beta})}[logq(\mathit{\beta})]+{\mathbb{E}}_{q(\mathit{\alpha})}[logp(\mathit{\alpha})]-{\mathbb{E}}_{q(\mathit{\alpha})}[logq(\mathit{\alpha})].$$

(28)

where the individual terms in (28) are given by

$${\mathbb{E}}_{q(\mathit{\beta})}[log\stackrel{\sim}{p}(\mathit{y}\mid \mathit{\beta})]=\frac{1}{2}{\mathit{\mu}}_{T}^{\top}{\mathit{\sum}}_{T}^{-1}{\mathit{\mu}}_{T}-\frac{d+1}{2}+\frac{1}{2}\left({\mathit{\mu}}_{T}^{\top}{\mathit{A}}_{T}{\mathit{\mu}}_{T}+\text{tr}({\mathit{\sum}}_{T}{\mathit{A}}_{T})\right)+\frac{1}{2}\sum _{t=1}^{T}\left(2log\sigma ({\xi}_{t})-{\xi}_{t}+2\phi ({\xi}_{t}){\xi}_{t}^{2}\right)$$

(29)

$${\mathbb{E}}_{q(\mathit{\beta})a(\mathit{\alpha})}[logp(\mathit{\beta}\mid \mathit{\alpha})]=-\frac{d+1}{2}log(2\pi )+\sum _{j=0}^{d}\frac{1}{2}\left(\psi ({a}_{T})-log{b}_{j,T}\right)-\frac{1}{2}\left({\mathit{\mu}}_{T}^{\top}{\mathit{A}}_{T}{\mathit{\mu}}_{T}+\text{tr}({\mathit{\sum}}_{T}{\mathit{A}}_{T})\right)$$

(30)

$${\mathbb{E}}_{q(\mathit{\beta})}[logq(\mathit{\beta})]=-\frac{d+1}{2}\left(1+log(2\pi )\right)+\frac{1}{2}log\mid {\mathit{\sum}}_{T}\mid $$

(31)

$${\mathbb{E}}_{q(\mathit{\alpha})}[logp(\mathit{\alpha})]=\sum _{j=0}^{d}\left({a}_{0}log{b}_{0}-log\mathrm{\Gamma}({a}_{0})+({a}_{0}-1)(\psi ({a}_{T})-log{b}_{j,T})-{b}_{0}\frac{{a}_{T}}{{b}_{j,T}}\right)$$

(32)

$${\mathbb{E}}_{q(\mathit{\alpha})}[logq(\mathit{\alpha})]=\sum _{j=0}^{d}\left(({a}_{T}-1)\psi ({a}_{T})-log\mathrm{\Gamma}({a}_{T})+log{b}_{j,T}-{a}_{T}\right)$$

(33)

where Γ(·) denotes the gamma function, and *ψ*(·) denotes the digamma function, which is defined as the logarithmic derivative of the gamma function. Other notations have been defined earlier following equation (21). Summarizing equations (29) through (33) yields (22). The pseudocode of the HVB algorithm is given below.

Initialize the hyperprior parameters a_{0} and b_{0}, and set the initial parameter value to 0 (i.e. = β0). |

while convergence criteria not met do |

Evaluate the data-dependent variational parameters = {ξξ} for each data points _{t}x (_{t}t = 1, …, T). |

Update the parameter variational posterior mean μ and variance _{T}Σ._{T} |

Update the noise precision hyperparameter [].A |

Update the hyperprior parameters a and _{T}b (_{j,T}j = 0, …, d). |

Compute the variational lower bound of marginal log-likelihood (Eq. 22). |

end while |

end |

^{1}In practice, the value of *K* (or *d*) needs to be determined using a statistical procedure, such as the *Akaike information criterion* (AIC) or *Bayesian information criterion* (BIC).

^{2}The Bayesian logistic regression algorithm [19], which uses a cyclic coordinate descent algorithm for either Gaussian or Laplace prior, is de facto an empirical Bayes algorithm, since the hyperparameter’s probability distribution was not modeled and the hyperparameter was selected by heuristics.

^{3}In the case of fitting low-firing rate neural spike trains, when the lengths of inter-spike intervals are comparable to the length of the observation window, a modified KS test that considers censoring can be used [51].

^{4}The seemingly high misidentification error was partially due to the setup of our simulations. While simulating the connection weights from a uniform distribution, we counted all weak connections (even very small values) as evidence of true connectivity. In the case of high connectivity ratio, there will be proportionally more weak connections. Consequently, the tested algorithms often failed to detect the “weak connections”, thereby causing a high FN error (see Fig. 3 and Table III). As expected, if in simulations we only maintain the strong connections, then the resultant mis-identification error rate would significantly reduce.

Zhe Chen, Neuroscience Statistics Research Laboratory, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114 USA, and also with the Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139 USA.

David F. Putrino, Neuroscience Statistics Research Laboratory, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114 USA, and also with the Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139 USA.

Soumya Ghosh, Centre for Neuromuscular & Neurological Disorders, University of Western Australia, QEII Medical Centre, Nedlands, Western Australia, Australia.

Riccardo Barbieri, Neuroscience Statistics Research Laboratory, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114 USA, and also with Massachusetts Institute of Technology, Cambridge, MA 02139 USA.

Emery N. Brown, Neuroscience Statistics Research Laboratory, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114 USA, and also with the Harvard-MIT Division of Health Science and Technology, and the Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139 USA.

1. Aertsen A, Gerstein G, Habib MK, Palm G. Dynamics of neuronal firing correlation: modulation of ‘effective connectivity’ J Neurophysiol. 1989;61:900–917. [PubMed]

2. Bishop CM. Pattern Recognition and Machine Learning. New York: Springer; 2006.

3. Brillinger DR, Villa EP. Assessing connections in networks of biological neurons. In: Brillinger DR, Fernholz LT, Morgenthaler S, editors. The Practice of Data Analysis: Essays in Honor of John W Turkey. 1997. pp. 77–92.

4. Brown EN, Barbieri R, Ventura V, Kass RE, Frank LM. The time-rescaling theorem and its application to neural spike data analysis. Neural Computat. 2002;14(2):325–346. [PubMed]

5. Brown EN, Barbieri R, Eden UT, Frank LM. Likelihood methods for neural data analysis. In: Feng J, editor. Computational Neuroscience: A Comprehensive Approach. CRC Press; 2003. pp. 253–286.

6. Brown EN, Kass RE, Mitra PP. Multiple neural spike train data analysis: state-of -the-art and future challenges. Nat Neurosci. 2004;7(5):456–461. [PubMed]

7. Brown EN. Theory of point processes for neural systems. In: Chow CC, et al., editors. Methods and Models in Neurophysics. Elsevier; 2005. pp. 691–727.

8. Chen Z, Putrino DF, Ba DE, Ghosh S, Barbieri R, Brown EN. A regularized point process generalized linear model for assessing the functional connectivity in the cat motor cortex. Proc. IEEE EMBC’09; Minneapolis, MN. 2009. pp. 5006–5009. [PMC free article] [PubMed]

9. Chen Z, Kloosterman F, Wilson MA, Brown EN. Variational Bayesian inference for point process generalized linear models in neural spike trains analysis. Proc. IEEE ICASSP’10; Dallas, TX. 2010. pp. 2086–2089.

10. Chornoboy E, Schramm L, Karr A. Maximum likelihood identication of neural point process systems. Biol Cybern. 1988;59:265–275. [PubMed]

11. Czanner G, Eden UT, Wirth S, Yanike M, Suzuki WA, Brown EN. Analysis of between-trial and within-trial neural spiking dynamics. J Neurophy. 2008;99:2672–2693. [PMC free article] [PubMed]

12. Daley DJ, Vere-Jones D. An Introduction to the Theory of Point Processes. 2. New York: Springer; 2003.

13. Efron B, Hastie T, Johnstone I, Tibshirani R. Least-angle regression. Ann Stat. 2004;32(2):407–499.

14. Eldawlatly S, Jin R, Oweiss K. Identifying functional connectivity in large scale neural ensemble recordings: A multiscale data mining approach. Neural Computat. 2009;21:450–477. [PMC free article] [PubMed]

15. Eldawlatly S, Zhou Y, Jin R, Oweiss K. On the use of dynamic Bayesian networks in reconstructing functional neuronal networks from spike train ensembles. Neural Computat. 2010;22:158–189. [PMC free article] [PubMed]

16. Fahrmeir L, Tutz G. Multivariate Statistical Modelling Based on Generalized Linear Models. 2. New York: Springer; 2001.

17. Friedman J, Hastie T, Tibshirani R. Regularized paths for generalized linear models via coordinate descent. J Statistical Software. 2010;33(1) [PMC free article] [PubMed]

18. Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. 2. Chapman & Hall/CRC; 2004.

19. Genkin A, Lewis DD, Madigan D. Tech Rep. Rutgers Univ; 2004. Large-scale Bayesian logistic regression for text categorization. http://www.stat.rutgers.edu/~madigan/BBR/

20. Gerstein GL, Perkel DH. Simultaneous recorded trains of action potentials: analysis and functional interpretation. Science. 1969;164:828–830. [PubMed]

21. Gerwinn S, Macke JH, Seeger M, Bethge M. Bayesian inference for spiking neuron models with a sparsity prior. In: Platt JC, Koller D, Singer Y, Roweis S, editors. Adv Neural Info Proc Syst (NIPS) Vol. 20. Cambridge, MA: MIT Press; 2008. pp. 529–536.

22. Ghosh S, Putrino DF, Burro B, Ring A. Patterns of spatio-temporal correlations in the neural activity of the cat motor cortex during trained forelimb movements. Somatosensory and Motor Research. 2009;26:31–49. [PubMed]

23. Hebiri M. Regularization with the smooth-lasso procedure. 2008. http://arxiv.org/abs/0803.0668.

24. Jaakkola TS, Jordan MI. Bayesian parameter estimation via variational methods. Statist Comput. 2000;10:25–37.

25. Kass RE, Ventura V, Brown EN. Statistical issues in the analysis of neuronal data. J Neurophysiol. 2005;94:8–25. [PubMed]

26. Krishnapuram B, Carin L, Figueiredo MAT, Hartemink AJ. Sparse multinomial logistic regression: fast algorithms and generalization bounds. IEEE Trans Patt Anal Mach Intell. 2005;27(6):957–968. [PubMed]

27. Koh K, Kim SJ, Boyd S. An interior-point method for large-scale _{1}-regularized logistic regression. J Machine Learning Res. 2007;8:1519–1555.

28. Komarek P. PhD thesis. Carnegie Mellon University; 2004. Logistic regression for data minding and high-dimensional classification.

29. McCullagh P, Nelder J. Generalized Linear Models. 2. London: Chapman & Hall; 1989.

30. Neal R. Bayesian Learning for Neural Networks. New York: Springer; 1996.

31. Nykamp D. A mathematical framework for inferring connectivity in probabilistic neuronal networks. Mathematical Biosciences. 2007;205:204–251. [PubMed]

32. Okatan M, Wilson MA, Brown EN. Analyzing functional connectivity using a network likelihood model of ensemble neural spiking activity. Neural Computat. 2005;17:1927–1961. [PubMed]

33. Park MY, Hastie T. An _{1} regularization-path algorithm for generalized linear models. J Roy Stat Soc B. 2007;69(4):659–677.

34. Pawitan Y. In All Likelihood: Statistical Modelling and Inference Using Likelihood. New York: Oxford Univ. Press; 2001.

35. Perkel DH, Gerstein GL, Moore GP. Neuronal spike trains and stochastic point processes. II. simultaneous spike trains. Biophy J. 1967;7:419–440. [PubMed]

36. Rigat F, de Gunst M, van Pelt J. Bayesian modelling and analysis of spatio-temporal neuronal networks. Bayesian Analysis. 2006;1(4):733–764.

37. Schmidt M, Fung G, Rosaless R. Tech Rep. Dept. Computer Sciences, Univ. Wisconsin; 2009. Optimization methods for _{1}-regularization. URL: http://pages.cs.wisc.edu/~gfung/GeneralL1/

38. Schneidman E, Berry M, Segev R, Bialek W. Weak pair-wise correlations imply strongly correlated network states in a neural population. Nature. 2006;440:10007–10212. [PMC free article] [PubMed]

39. Seeger M, Gerwinn S, Bethge M. Bayesian inference for sparse generalized linear models. Proc ECML’07. 2007:298–309.

40. Shlens J, Field GD, Gauthier JL, et al. The structure of multi-neuron firing patterns in primate retina. J Neurosci. 2006;26:8254–8266. [PubMed]

41. Stevenson IH, Rebesco JM, Miller LE, Körding KP. Inferring functional connections between neurons. Curr Opin Neurobiol. 2008;18:582–588. [PMC free article] [PubMed]

42. Stevenson IH, Rebesco JM, Hatsopoulos NG, Haga Z, Miller LE, Körding KP. Bayesian inference of functional connectivity and network structure from spikes. IEEE Trans Neural Syst Rehab Engr. 2009;17:203–213. [PMC free article] [PubMed]

43. Strangman G. Detecting synchronous cell assemblies with limited data and overlapping assemblies. Neural Computat. 1997;9:51–76. [PubMed]

44. Tibshirani R. Regression shrinkage and selection via the Lasso. J Roy Stat Soc B. 1996;58(1):267–288.

45. Tibshirani R. The Lasso for variable selection in the Cox model. Statistics in Medicine. 1997;16:385–395. [PubMed]

46. Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and smoothness via the fused lasso. J Roy Stat Soc B. 2005;67:91–108.

47. Truccolo W, Eden UT, Fellow M, Donoghue JD, Brown EN. A point process framework for relating neural spiking activity to spiking history, neural ensemble and covariate effects. J Neurophysiol. 2005;93:1074–1089. [PubMed]

48. Truccolo W, Donoghue JD. Nonparametric modeling of neural point processes via stochastic gradient boosting regression. Neural Computation. 2007;19(3):672–705. [PubMed]

49. Truccolo W, Hochberg LR, Donoghue JP. Collective dynamics in human and monkey sensorimotor cortex: predicting single neuron spikes. Nat Neurosci. 2010;13:105–111. [PMC free article] [PubMed]

50. Utikal KJ. A new method for detecting neural interconnectivity. Biol Cybern. 1997;76:459–470.

51. Wiener MC. An adjustment to the time-rescaling method for application to short-trial spike train data. Neural Computat. 2003;15:2565–2576. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |