Data were collected with a Neuralynx Cheetah system connected to four tetrodes, each independently processed. The algorithm presented here was developed using recordings made in the inferior colliculus of awake rabbit, employing general methods described previously for single-unit recordings (

Nelson and Carney, 2007) except that tetrodes were used. During recording sessions, animals were seated in a plastic chair inside a sound-attenuated booth. The head was fixed to allow delivery of sound stimuli to each ear via custom earmolds. All methods were approved by the University of Rochester Committee on Animal Research and complied with NIH guidelines.

The algorithm is also illustrated using a data set recorded from a different midbrain region, the nucleus of the brachium of the inferior colliculus of awake marmoset, provided by S. Slee at Johns Hopkins University. These recordings were made with Thomas Recording tetrodes in a fixed-head preparation to allow controlled delivery of acoustic stimuli. Because all of the data of interest were in fixed-head recordings, motion artifacts associated with freely moving animals were not a concern here. Although awake animals in a head-fixed preparation occasionally shrug or fidget, the artifacts associated with these movements can typically be eliminated as outlier waveforms.

Positive- or negative-going voltage transitions of sufficient amplitude on any one of the four tetrode wires triggered events that were “snapshots” of the waveforms recorded by those wires, each consisting of 32 samples at a rate of 32,051 samples/sec. The processing described here is equally valid for spikes of opposite polarity simply by inverting the detection pattern. The waveforms were collected in a file for later analysis.

The events from a single two-hour recording session can number up to several hundred thousand. One goal was to make the evaluation of an experiment take no more than a few minutes so that adjustments to the experimental procedure, such as the position of the tetrode, could be made in “real time.” To achieve the desired speed, rather than clustering all the events, a subset is selected and used to train the clustering algorithm by collecting information that characterizes the clusters. Occasionally, it was noticed that the characteristic spike shape of a neuron changed slightly and gradually during the experiment, but not so much that the spikes become confused with those of another neuron. Consequently, in order to capture the full range of shapes, the training set must include events distributed across the whole experiment. It is also necessary to use events that are contiguous so that interval histograms can be computed. To satisfy both requirements, the full set of events is broken into blocks of contiguous events, and the blocks are uniformly distributed throughout the experiment. If the training set has

*M* events, it is appropriate to form approximately

blocks of approximately

events each. The exact numbers are not important; in practice

*M* might be chosen to be 10,000–20,000.

From the features computed for the training set, the number of clusters, *k*, is estimated visually, and the statistical properties (mean and covariance matrix) of each cluster are computed. The cluster means and covariance matrices are used to classify the full set of events.

To summarize, the processing procedure is as follows: 1) record events, 2) train clustering algorithm: 2a) choose subset of events, 2b) compute feature vectors, 2c) choose *k* by visual inspection of feature vector plots, 2d) cluster, 2e) repeat 2c–2d as necessary to achieve an acceptable result, 3) compute feature vectors for all events, and 4) classify all events into *k* clusters. Each of these steps is described in more detail below.

2.1. Feature computation with repolarization slope (RPS)

Action potentials from different neurons tend to have different shapes, but all have the general characteristics of the spike shown schematically in .

As previously stated, the proposed waveform feature is computed by taking the maximum value of the cross-correlation of the waveform with a fixed pattern,

*p*, defined in

Eq. 1. Samples of a spike waveform

*s*(

*t*), triggered at

*t* = 0, are given by

where

*T* is the sampling interval and

*N*_{s} is the number of samples recorded per spike. For a pattern given by

*p* [

*n*], the feature,

*r*, is computed by

Mathematically, cross-correlation is the same as convolution when the pattern has been time-reversed, so the feature can also be computed as

where

*q* [

*n*] =

*p* [−

*n*].

Representing the cross-correlation as a convolution is convenient because the operation can be interpreted as passing the spike waveform through a digital filter. Indeed, a filter whose impulse response is equal to the time-reverse of a pattern to be matched is called a

*matched filter* (

Turin, 1960) because the impulse response is matched to the signal being detected. It is interesting to note that the filter impulse response,

*q*, is exactly the convolution of two other sequences,

where

*K*_{d} performs a derivative-like operation and

*K*_{f} is a low-pass filter with the frequency response shown in for

*n* = 4. Therefore, convolution with

*q* produces an approximation to the filtered derivative of each of the four waveforms recorded by the tetrode, and each computed feature is approximately proportional to the magnitude of the slope of the spike in the region in which it is falling most rapidly, normally the repolarization region (). It is from this description that the algorithm receives its name,

*repolarization slope* (RPS).

Due to noise, the exact phase at which triggering occurs varies, resulting in a random variation of the spike location along the time axis. Some feature-extraction algorithms require spike alignment to remove this temporal variation, but alignment is not necessary for this algorithm because each feature is computed from the maximum value of the whole cross-correlation function rather than from a specific sample.

The feature vector for an event is simply the concatenation of the four *r* values, one from each waveform recorded by the tetrode. These values are further concatenated into an *M*×4 feature matrix, which is the set of *M* feature vectors for the *M* events.

2.2. Feature-vector visualization and selection of number of clusters

The next step in the spike sorting process is to visualize the feature vectors so that the number of clusters,

*k*, can be determined. Unfortunately, each feature vector is a four-dimensional entity that is hard to visualize and impossible to plot. One approach to this problem is to plot the two dimensions with the greatest variance, as obtained from PCA (

Glaser and Marks, 1968;

Abeles and Goldstein Jr, 1977). As stated previously, use of PCA can be unsatisfactory. Instead, an array of scatter plots is displayed, similar to those shown in , each of which plots two different columns of the feature matrix against each other. Visual inspection of the scatter plots has usually been found to be sufficient to determine the number of clusters,

*k* (

*k* = 3 in the example shown). However, because these plots only show two dimensions at a time, it is possible for clusters to remain hidden. That has not proved to be a problem in practice, and it is easy to try different values of

*k* in any case.

2.3. Cluster initialization

The standard

*k*-means algorithm (

Lloyd, 1982) chooses the initial cluster centers to be the coordinates of

*k* points chosen at random from the data before proceeding with the iterative clustering. This strategy can produce cluster centers that are close together resulting in clusters that are not properly defined, so it is standard practice to run

*k*-means several times until a good result has been obtained. The

*k*-means++ algorithm (

Arthur and Vassilvitskii, 2007) was devised to reduce or eliminate these multiple runs by making a better choice for the set of initial cluster centers. (Despite its name,

*k*-means++ is not a clustering algorithm—it only performs the cluster initialization step.) The algorithm consists of the following steps: a) choose one point at random as the first cluster center, b) for each point to be clustered,

*x*, compute the Euclidean distance,

*D*_{e}(

*x*), to the nearest existing cluster center, c) add one point at random as a new cluster center using a weighted probability where point

*x* is chosen with probability proportional to

, and d) repeat b and c until

*k* centers have been chosen. In practice, this technique usually selects cluster centers that are well separated.

2.4. Iterative cluster determination and Mahalanobis distance

The iterative clustering of any variant of *k*-means consists of just two steps: 1) assign each point to the nearest cluster, and 2) recompute the cluster descriptive statistics (e.g., mean). These steps are repeated until the assignments do not change from one iteration to the next.

In order to assign each point to the nearest cluster, it is necessary to define some notion of distance. Ordinary Euclidean distance does not work well when the clusters are elongated and close to each other, as they often are in this application.

Mahalanobis (1936) distance is a statistical metric used to determine how well a point,

*x*, fits in a whole distribution of points,

*C*, taking into account the shape, size and orientation of the distribution, and is defined by

where

*C* is described by its mean, μ

_{C}, and covariance matrix, Σ

_{C}. The distance is unitless because it is a relative measure. An example is illustrated in .

To use Mahalanobis distance in the determination of the nearest cluster, Σ_{C} must be nonsingular. Immediately after initialization, each cluster contains only one point, and all covariance matrices are identically zero and singular. Intuitively, the inability to compute Mahalanobis distance from a cluster of one point makes sense, because it is impossible to infer anything about the size or shape of such a cluster. Consequently, whenever a cluster has only one point, as will always occur in the first iteration, some method other than Mahalanobis distance must be used to compute point-cluster distances. Euclidean distance was chosen as the alternate method, because it does not require a covariance matrix.

To illustrate the difference between the Euclidean and Mahalanobis distance measures, shows equal distance contours from two clusters.

2.5. Scaling by cluster size, k-means with Scaled Mahalanobis Distance (KSMD)

With the strategy presented so far, the clustering algorithm often failed to find small clusters located close to large ones, included too many points from a small cluster in a nearby large cluster, produced cluster regions with holes (one cluster completely surrounding another cluster), or produced disjoint cluster regions. It was hypothesized that these shortcomings could be eliminated or reduced if the Mahalanobis distance measures were scaled by a factor related to the size of the cluster (

Quian Quiroga et al., 2004;

Harris et al., 2000;

Shoham et al., 2003). The idea is that the scaled distance of a point from a large cluster should have a larger numerical value than a point with the same Mahalanobis distance from a small cluster, encouraging the point to belong to the smaller cluster.

Intuitively, the scale factor should be based on a linear measure of cluster size rather than volume, which was confirmed empirically. Consequently, the cluster size is expressed as the length of a side of the

*N*-dimensional hypercube with the same volume as the cluster, i.e., the

*N*^{th} root of the volume. Strictly speaking, a cluster is a distribution, not a polytope, and does not have a volume, but for this application the volume,

*V*, can be defined as the product of the widths of the cluster along the

*N* orthogonal axes of the cluster,

where λ

_{i} is the

*i*^{th} eigenvalue of the covariance matrix of the cluster members. These λ

_{i} are exactly equal to the variances of the distribution in the axial directions if the cluster were lined up with the coordinate axes via PCA. As mentioned, the side length,

, is obtained simply by taking the

*N*^{th} root of the volume,

Finally, the scale factor for cluster

*C*_{j} was chosen to be

where the exponential parameter, α, has been introduced to control the scaling effect. The nominal value of α = 1 generally succeeds in minimizing the shortcomings present without the scaling. Setting α to a value much above 1 can introduce unwanted artifacts (). Values of α between 0 and 1 were not tried, but might prove useful. Setting α = 0 is equivalent to having no scaling.

2.6. Cluster evaluation

An important part of clustering is evaluation of the resulting clusters (

Hill et al., 2011;

Joshua et al., 2007). Whether because the wrong number of clusters was chosen or because the random component in the initialization step caused the clustering algorithm to perform poorly, sometimes it is necessary to recluster. There are several ways to perform this evaluation: visual inspection of the feature vector scatter plot, computation of a cluster separation metric, inspection of waveform histograms, and examination of interval histograms. Each approach will be described below and illustrated in the Results section.

2.6.1. Visual evaluation of feature vectors The visual evaluation involves making a paired scatter plot in which the points from each cluster are plotted in a different color. A simple visual inspection by the user can indicate whether the clustering was successful. Poor clustering is often obvious, as in . Sometimes, even when the number of clusters is easy to estimate, the clustering algorithm is unable to cluster the points as desired, a failure that is easily determined by visual observation of the scatter plots.

2.6.2. Cluster separation metric A quantitative cluster separation metric called

*L*_{ratio} (

Schmitzer-Torbert et al., 2005) is defined by

where

is the cumulative distribution function of the χ

^{2} distribution with

*N* degrees of freedom (

*N* is the dimensionality of the feature space),

*D*_{C}(

*x*_{i}) is the Mahalanobis distance of point

*x*_{i} from cluster

*C*, and

*n*_{C} is the number of points in

*C*.

*L* represents a measure of the isolation of

*C* so that a cluster with a significant gap surrounding it will have a small value.

*L*_{ratio} normalizes

*L*(

*C*) by the number of points in

*C* so that if an acceptability criterion is established, a dense cluster will be able to tolerate more contamination than a sparse one. The values are mostly useful for differentiating between multiple clusterings of the same data and not intended to be used to rate different data sets. Because each

*L*_{ratio} ≥ 0 and small values indicate good cluster separation, the sum,

will only be small if each

*L*_{ratio} is small and thus serves as an overall figure of merit for the clustering.

2.6.3. Interval histograms After a neuron has fired it cannot fire again during its refractory period (

Hodgkin and Huxley, 1952), which typically has a duration of about 1ms for neurons that have relatively high discharge rates (auditory nerve:

Gray, 1967; inferior colliculus:

Yagodnitsyn and Shik, 1974). This characteristic can be used to test if the spikes in a given cluster could actually have been produced by a single neuron. The distribution of first-order spike interval times (the length of time between a spike and the next spike from the same neuron) is examined to see what fraction of spike intervals are less than the refractory period. If the fraction is zero, then it is possible that all the spikes in that cluster were produced by a single neuron. In practice, some cluster contamination is acceptable (spikes incorrectly assigned to that cluster), perhaps 0.5%.

Conversely, it is expected that many intervals from a spike in one cluster to the subsequent spike *in another cluster* will be less than the refractory period. Therefore, examination of cross-cluster spike intervals can indicate whether spikes from a single neuron have been assigned to more than one cluster, as happens in over-clustering.

2.6.4. Waveform histograms After clustering, a natural way to evaluate whether all the spikes in a cluster have similar shapes is to plot them on a single graph. A single graph works well if there are few enough waveforms that most of them remain visible despite the clutter, but it does not accurately convey the waveform distribution when many waveforms are obscured. Instead, the waveform histogram has been defined that is simply a sequence of histograms of the waveform amplitudes at each time sample for a given cluster. To display the waveform histogram, an image is constructed by displaying each bin of each histogram as a single pixel—the number of pixels in the vertical direction is the number of bins and the number of pixels in the horizontal direction is the number of samples per waveform. The gray level of each pixel is proportional to the corresponding histogram value, leading to a fuzzy-looking spike waveform. The degree of fuzziness indicates the width of the distribution: a well-defined appearance implies a narrow distribution and a fuzzier appearance implies a broad distribution. See results for examples (). See also

Hill et al. (2011).

Previously, it was stated that one of the advantages of RPS was that it does not require the spikes to be aligned. That is true, but the waveform histogram does require spike alignment for maximum utility. Spike alignment, using an approach that aligns the peak values of the waveforms after passing them through a low-pass filter, was included in the waveform histograms shown here.

2.6.5. Cluster temporal stability A final cluster evaluation is to examine the temporal stability of the clusters. In some applications, the signal from a neuron can change during a measurement session, perhaps due to a slight shift in position of the tetrode, resulting in apparently two (or more) clusters from a single neuron. To ensure that a single neuron has not been manifested as multiple clusters, it is sufficient to examine histograms of spike times for each cluster. If the times of the spikes in each cluster are distributed throughout the session it is likely that the clusters are stable.

2.7. Classification of all points

If the feature vectors used for training the final clustering are representative of the whole set of feature vectors, then it is reasonable to assume that the cluster mean and covariance matrices obtained from training are accurate estimates of those that would be obtained from the whole data set. The mean and covariance matrix of each cluster of the training set can be used to cluster the full set in an efficient way. All that is required is to compute the scaled Mahalanobis distance of each feature vector from each training cluster and assign each feature vector to the nearest cluster.

2.8. Implementation and clustering procedure

The algorithms described above have been implemented in MATLAB (The MathWorks, Inc., Natick, Massachusetts), in a convenient application with a graphical user interface.^{1} It is important to note that RPS and KSMD are merely two of the tools that might be employed and a full-featured application supplies other tools to try if RPS or KSMD were not effective on a particular data set.

Generally, the clustering procedure is as follows: 1) load the waveforms for the training set of events, 2) compute the feature vectors using the default method (e.g., RPS) and display the paired scatter plots, 3) estimate the number of clusters, compute them with the default clustering algorithm (e.g., KSMD with α = 1), and update the scatter plot display with different colors for each cluster, 4) check the results by examining the paired scatter plots, waveform histograms, interval histograms, and *L*_{Σ}, 5) if the results are not acceptable, try a different number of clusters, feature computation, and/or clustering algorithm until acceptable results are obtained, and 6) using the selected algorithms and cluster statistics gathered from the training set, cluster the whole data set.