|Home | About | Journals | Submit | Contact Us | Français|
In this paper, we propose a new blind multichannel adaptive filtering scheme, which incorporates a partial-updating mechanism in the error gradient of the update equation. The proposed blind processing algorithm operates in the time-domain by updating only a selected portion of the adaptive filters. The algorithm steers all computational resources to filter taps having the largest magnitude gradient components on the error surface. Therefore, it requires only a small number of updates at each iteration and can substantially minimize overall computational complexity. Numerical experiments carried out in realistic blind identification scenarios indicate that the performance of the proposed algorithm is comparable to the performance of its full-update counterpart, but with the added benefit of a highly reduced computational complexity.
In recent years, blind source separation (BSS) has received much attention due to its strong potential for use in a variety of applications, such as automatic speech recognition, hearing aid devices and hands-free telephony. In scenarios where the signals of interest are mixed with other ongoing background activity and noise, BSS can be used to extract and perceptually enhance the waveform of the desired sound source(s) from a set of composite signals. By definition, BSS recovers estimates of n sources with very little to almost no prior knowledge about the source-to-sensor geometry or the source signal themselves. Instead, it relies only on information collected from a set of m convolutive data x(t) = [x1(t),…, xm(t)]T m
where H(t) represents the unknown but linear-time invariant (LTI) multiple-input multiple-output (MIMO) mixing system at discrete time t and lag . The ‘blind’ recovery of the original sources then boils down to a multichannel inverse filtering task whereby the coefficients of an L-dimensional finite impulse response (FIR) equalizer W(t) are adjusted such that the output vector u(t) = [u1(t),…, un(t)]T n
defined for all 0 ≤ ≤ L − 1. Nevertheless, since the computational complexity of any adaptive filtering algorithm is proportional to its tap length, time-domain multichannel BSS algorithms can become computationally prohibitive especially for applications that require a large number of filter taps (e.g., see , ).
To reduce excessive computational requirements, many authors have recently suggested to carry out the separation task in the frequency or the sub-band domain (e.g., see –, ). By doing so, the overall BSS task can be elegantly reduced into several independent instantaneous problems, one for each frequency subband. Still such strategies come with perils of their own, namely scaling and permutation ambiguities, which usually have a negative effect on separation performance . Another strategy for reducing the length of the inverse FIR filters is to simply increase the number of microphones present in the system. In setups where the system sensors available outnumber the active sources, the inverse FIR filters can be chosen to be significantly shorter 1 than the length of the acoustic impulse responses . Nonetheless, the design requirements in modern digital signal processors (DSPs), for example such as those used in hearing aids and cochlear implant devices , often dictate a low power consumption in battery operated equipment and also call for a portable weight in a compact size. Naturally, such design constraints prevent the use of BSS and other similar pre-processing techniques on portable devices since they (1) prohibit large amounts of processing power and (2) limit the number of microphones that can be made available to the user.
A far more efficient alternative to lower the overall computational complexity of BSS is to resort to selective-tap (or partial) updating schemes. Such schemes operate by updating only a subset of the total number of filter coefficients on every iteration and hence can lead to a substantially reduced complexity. Early examples of selective-tap techniques include the sequential and periodic least-mean-squares (LMS)  and more recently the MMax normalized least-mean-squares (MMax-NLMS) , , whereby the updated taps are the ones associated with the largest magnitude gradient components on the error surface. In this paper, we extend the MMax tap-selection criterion and derive a novel selective-updating blind scheme based on the natural gradient algorithm (S-NGA). The potential of the proposed low-complexity algorithm is verified and assessed through numerical simulations in realistic acoustical scenarios.
Spatial independence is a key assumption for BSS. In short, it implies that the output joint probability density function (PDF) pu(u(t)) is equal to the product of the output PDFs
A simple way to check for independence is to measure the distance between the two sides of (3) with an appropriate distance measure such as the Kullback-Leibler (K-L) divergence. Alternatively, we can resort to the maximum likelihood (ML) principle and choose to optimize the negative log-likelihood (objective) function with respect to the unmixing matrix W such that
Due to the Riemannian nature of the optimization parameter space, a less computationally burdensome choice to minimize the cost function in (4) with respect to the unmixing matrix W is the so-called natural gradient , which is an optimal re-scaling of the standard (stochastic) entropy gradient (e.g., see ). For multipath conditions where L > 1, the natural gradient algorithm (NGA) is given by –, 
where 0 < μ < 1 is a positive learning parameter controlling the rate of convergence and rate of adaptation, (·)H is the Hermitian operator and symbol ϕ(·) represents the nonlinear monotonic activation (or score) function operating elementwise on the output signal vector, such that2
with pui(ui) denoting the PDF of each source estimate ui. Note also that vector ũ(t) represents the reverse-filtered output computed by using the latest (L − 1)-samples backwards from the current sample t for all lags = 0, 1,, L − 1 as shown in (7).
When approaching convergence, ΔW(t) 0, assuming a sufficiently small step-size. The stationary points of (6) can guarantee both temporal and spatial statistical independence under the following two conditions
where E[·] is the statistical expectation, (·)* represents the complex-conjugate operator and δ is the Kronecker delta, which is equal to 1 for = 0 and equal to 0 otherwise. As it can be seen from (10), the convergence behavior of the NGA depends solely upon the magnitude of the so-called estimating function at each iteration, which is equal to
for lags = 0, 1,, L − 1. Since the above estimating function is not equally sensitive to variations from all the updated filter coefficients, a tap-selection criterion can be constructed by employing only M out of L coefficients with the largest values of |R(t)| for all lags = 0, 1,…, L − 1, at each iteration. The subset of the filter coefficients to be partially updated at any particular time t is specified in the (nL × mL) matrix Q(t), which is coined the tap selection matrix, and is given by
where each element of the tap-selection matrix is given by
such that, after dropping the time index t for convenience,
where |·| denotes absolute value and r(τ) are the elements of (11) at the (sorted) indices τ = 0, 1,…, L − 1. Every element qij(τ) is either equal to one or zero, depending on whether the condition in (14) is satisfied or not. In order to calculate the filter coefficients that are to be updated at different time instants, a fast sorting algorithm needs to be executed at every iteration . After the sorting is performed, each block of the tap selection matrix Q(t) contains M < L coefficients equal to one in the positions (or indices) calculated from (14) and zeros elsewhere, such that
where operator trace[·] denotes the sum of the diagonal elements of the input matrix argument. Accordingly, to update only M taps in the equalizer W(t), we can write the selective-natural gradient algorithm (S-NGA), as follows
where parameter λ represents the new learning rate. Note that for M = L, the S-NGA algorithm in (16)–(19) reduces to the full-update NGA algorithm given in (5)–(7). In general, the separation performance of the S-NGA depends on the degree of coefficient reduction achieved and therefore on the overall number of the sorted filter taps in (t).
To assess the performance of the S-NGA, five male speakers are corrupted by interfering speech. The signals are chosen from the IEEE speech corpus, which consists of phonetically balanced sentences, with each sentence being composed of approximately 7 to 12 words (e.g., see ). Every sentence produced by a male talker is designated as the target speech. To simulate the speech interferer or competing voice in this experiment, a female talker uttering the sentence “Tea served from the brown jag is tasty” is chosen as the interferer (or masker) source. The source signals are approximately 4 s in duration and are recorded at a sampling rate of 8 kHz. A set of five convolutive speech mixtures are produced by convolving the clean signals with the two binaural room impulse responses (BRIRs) depicted in Fig. 1 (a) and (b) (e.g., see ). The length of the acoustic impulse responses is 2,048 sample points, corresponding to a delay of around 256 ms at a sampling rate of 8 kHz.
In order to simplify the estimation problem, the channel distortions from the target source to microphone 1 and from the female masker to microphone 2 are assumed negligible. The BRIRs are measured in a 5 × 9 × 3.5 m ordinary classroom using a KEMAR positioned at 1.5 m above the floor and at ear level . The broadband reverberation time3 of this enclosure is equal to TR = 200 ms, which is a typical value for a moderately reverberant environment. Both speech signals have the same onset and are normalized so that their maximum amplitude is unity. By convolving the speech signals with these pre-measured impulse responses, the target male is positioned directly at the front of the listener at a 0° azimuth, whereas the female interferer is placed at an angle of 60° to the right.
The S-NGA is executed with L = 2,048, whereas M is set to 2,048, 1,024, 512 and 256 taps. The learning rates are explicitly tuned to yield the maximum possible steady-state performance. In all cases, the equalizer is initialized using a center-tap scheme, such that W(0) = δ − M/2 I. The algorithm operates with the hyperbolic tangent score function ϕi(ui) = tanh(ui) and converges after approximately 20 passes through the convolutive speech data. To assess separation performance we resort to the signal-to-interference-ratio improvement (SIRI). SIRI is defined as the overall amount of crosstalk reduction achieved by the algorithm before (SIRi) and after (SIRo) the unmixing stage and is described in .
We test the S-NGA on a set of convolutive speech mixtures generated using the pre-measured impulse responses in Fig. 1. Table I contrasts performance and reduction in complexity relative to the complexity of the full-update algorithm. As expected, the full-update NGA yields the best SIRI performance. In addition, even when the S-NGA operates with a reduced number of filter coefficients at every iteration, it exhibits only a slightly lower separation performance. SIRI values indicate that the overall degree of separation remains unchanged for M = 1,024. In fact, even when setting M = 512, which accounts for a 75% reduction in the total equalizer length (with a processing delay of just 64 ms at 8 kHz) and around 40% reduction in computational complexity, the algorithm manages to retain almost 80% of its full-update counterpart performance. The overall computational complexity of the S-NGA in terms of floating point operations per second (FLOPS) is around 50% less when only 3/16 of filter coefficients (amounting to a 81.25% reduction) are adapted at every iteration. In all cases, to ensure that the complexity due to coefficient selection is kept low, we use a fast sorting routine (e.g., see SORTLINE ), which only requires an additional log2L + 2 tap comparison operations per sample.
We have developed a selective-updating BSS scheme (S-NGA), which can learn multiple filters at a substantially reduced computational overhead, while retaining a satisfactory separation performance. Experiments in reverberant settings, show that the overall tradeoff between filter length reduction and performance loss is acceptable. The S-NGA has great potential for use in portable devices, e.g., hearing aids and cochlear implants, since it can operate with considerably shorter filters and still equalize long acoustic echo paths with sufficient accuracy.
This work was supported by Grants R03 DC 008882 and R01 DC 007527 awarded from the National Institute on Deafness and Other Communication Disorders (NIDCD) of the National Institutes of Health (NIH).
1For example in a two-source and three-sensor configuration convolved with 1,024 sample point acoustic impulse responses, the lower bound for the FIR filter length L yielding a perfect system inverse can decrease to around 512 taps, if we assume that the number of microphones doubles.
2Instead of using the definition in (8), we could re-write the update in (6) as ΔW(t) = μ [W(t) − (u(t − L + 1)) u˜H (t − )] whereby we are introducing the (L − 1)-sample delay in term u(t) to accommodate for non-causal parts of the equalizer filters .
3The parameter TR defines the interval in which the reverberating sound energy, due to decaying reflections, reaches one millionth of its initial value. In other words, it is the time it takes for the reverberation level to drop by 60 dB below the original sound energy present in the room at a given instant.