Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2856458

Formats

Article sections

- Abstract
- I. Introduction
- II. Selective-Tap Blind Source Separation
- III. Experimental Methodology
- IV. Results and Discussion
- V. Conclusions
- VI. References

Authors

Related links

Conf Proc IEEE Eng Med Biol Soc. Author manuscript; available in PMC 2010 April 19.

Published in final edited form as:

PMCID: PMC2856458

NIHMSID: NIHMS192076

Center for Robust Speech Systems, Department of Electrical Engineering, University of Texas at Dallas, Richardson, TX 75080, USA

Kostas Kokkinakis: ude.salladtu@kanikkok; Philipos C. Loizou: ude.salladtu@uoziol

The publisher's final edited version of this article is available at Conf Proc IEEE Eng Med Biol Soc

In this paper, we propose a new blind multichannel adaptive filtering scheme, which incorporates a partial-updating mechanism in the error gradient of the update equation. The proposed blind processing algorithm operates in the time-domain by updating only a selected portion of the adaptive filters. The algorithm steers all computational resources to filter taps having the largest magnitude gradient components on the error surface. Therefore, it requires only a small number of updates at each iteration and can substantially minimize overall computational complexity. Numerical experiments carried out in realistic blind identification scenarios indicate that the performance of the proposed algorithm is comparable to the performance of its full-update counterpart, but with the added benefit of a highly reduced computational complexity.

In recent years, *blind source separation* (BSS) has received much attention due to its strong potential for use in a variety of applications, such as automatic speech recognition, hearing aid devices and hands-free telephony. In scenarios where the signals of interest are mixed with other ongoing background activity and noise, BSS can be used to extract and perceptually enhance the waveform of the desired sound source(s) from a set of composite signals. By definition, BSS recovers estimates of *n* sources with very little to almost *no prior* knowledge about the source-to-sensor geometry or the source signal themselves. Instead, it relies only on information collected from a set of *m* convolutive data ** x**(

$$\mathit{\text{x}}(t)=\sum _{\ell =0}^{\infty}{\mathbf{\text{H}}}_{\ell}(t)\phantom{\rule{0.2em}{0ex}}\mathit{\text{s}}(t-\ell ),\phantom{\rule{0.2em}{0ex}}t=1,2,\dots $$

(1)

where **H**_{}(*t*) represents the unknown but linear-time invariant (LTI) multiple-input multiple-output (MIMO) mixing system at discrete time *t* and lag . The ‘blind’ recovery of the original sources then boils down to a multichannel inverse filtering task whereby the coefficients of an *L*-dimensional finite impulse response (FIR) equalizer **W**_{}(*t*) are adjusted such that the output vector ** u**(

$$\mathit{\text{u}}(t)=\sum _{\ell =0}^{L-1}{\mathbf{\text{W}}}_{\ell}(t)\phantom{\rule{0.2em}{0ex}}\mathit{\text{x}}(t-\ell ),\phantom{\rule{0.2em}{0ex}}t=1,2,\dots $$

(2)

defined for all 0 ≤ ≤ *L* − 1. Nevertheless, since the computational complexity of any adaptive filtering algorithm is proportional to its tap length, time-domain multichannel BSS algorithms can become computationally prohibitive especially for applications that require a large number of filter taps (e.g., see [1], [2]).

To reduce excessive computational requirements, many authors have recently suggested to carry out the separation task in the frequency or the sub-band domain (e.g., see [3]–[4], [5]). By doing so, the overall BSS task can be elegantly reduced into several independent instantaneous problems, one for each frequency subband. Still such strategies come with perils of their own, namely *scaling* and *permutation* ambiguities, which usually have a negative effect on separation performance [6]. Another strategy for reducing the length of the inverse FIR filters is to simply increase the number of microphones present in the system. In setups where the system sensors available outnumber the active sources, the inverse FIR filters can be chosen to be significantly shorter ^{1} than the length of the acoustic impulse responses [7]. Nonetheless, the design requirements in modern digital signal processors (DSPs), for example such as those used in hearing aids and cochlear implant devices [8], often dictate a low power consumption in battery operated equipment and also call for a portable weight in a compact size. Naturally, such design constraints prevent the use of BSS and other similar pre-processing techniques on portable devices since they (1) prohibit large amounts of processing power and (2) limit the number of microphones that can be made available to the user.

A far more efficient alternative to lower the overall computational complexity of BSS is to resort to *selective-tap* (or *partial*) updating schemes. Such schemes operate by updating only a *subset* of the total number of filter coefficients on every iteration and hence can lead to a substantially reduced complexity. Early examples of selective-tap techniques include the sequential and periodic least-mean-squares (LMS) [9] and more recently the MMax normalized least-mean-squares (MMax-NLMS) [10], [11], whereby the updated taps are the ones associated with the largest magnitude gradient components on the error surface. In this paper, we extend the MMax tap-selection criterion and derive a *novel* selective-updating blind scheme based on the natural gradient algorithm (S-NGA). The potential of the proposed low-complexity algorithm is verified and assessed through numerical simulations in realistic acoustical scenarios.

Spatial independence is a key assumption for BSS. In short, it implies that the output joint probability density function (PDF) *p _{u}*(

$${p}_{\mathit{\text{u}}}(\mathit{u}(t))=\prod _{i=1}^{n}{p}_{{u}_{i}}({u}_{i}(t)).$$

(3)

A simple way to check for independence is to measure the distance between the two sides of (3) with an appropriate distance measure such as the Kullback-Leibler (K-L) divergence. Alternatively, we can resort to the *maximum likelihood* (ML) principle and choose to optimize the negative log-likelihood (objective) function with respect to the unmixing matrix **W**_{} such that

$${J}_{i}({u}_{i})=-\sum _{i=1}^{n}\mathit{\text{E}}\left[log{p}_{{u}_{i}}({u}_{i})\right]-log\left|det({\mathbf{\text{W}}}_{\ell})\right|$$

(4)

Due to the Riemannian nature of the optimization parameter space, a less computationally burdensome choice to minimize the cost function in (4) with respect to the unmixing matrix **W**_{} is the so-called *natural gradient* [12], which is an optimal re-scaling of the standard (stochastic) entropy gradient (e.g., see [13]). For multipath conditions where *L* > 1, the *natural gradient algorithm* (NGA) is given by [1]–[2], [12]

$${\mathbf{\text{W}}}_{\ell}(t+1)={\mathbf{\text{W}}}_{\ell}(t)+\Delta {\mathbf{\text{W}}}_{\ell}(t)$$

(5)

$$\Delta {\mathbf{\text{W}}}_{\ell}(t)=\mu \left[{\mathbf{\text{W}}}_{\ell}(t)-\mathit{\varphi}(\mathit{\text{u}}(t)){\stackrel{\sim}{\mathit{\text{u}}}}^{H}(t-\ell )\right]$$

(6)

$$\stackrel{\sim}{\mathit{\text{u}}}(t)=\sum _{k=0}^{L-1}{\mathbf{\text{W}}}_{L-1-k}^{H}(t)\mathit{\text{u}}(t-k)$$

(7)

where 0 < *μ* < 1 is a positive learning parameter controlling the rate of convergence and rate of adaptation, (·)* ^{H}* is the Hermitian operator and symbol

$$\begin{array}{ll}\mathit{\varphi}(\mathit{\text{u}}(t))\triangleq & [{\mathit{\varphi}}_{1}({u}_{1}(t)),\dots ,{\mathit{\varphi}}_{1}({u}_{1}(t-L+1)),\dots ,\\ & {{\mathit{\varphi}}_{n}({u}_{n}(t)),\dots ,{\mathit{\varphi}}_{n}({u}_{n}(t-L+1)]}^{T}\end{array}$$

(8)

where

$${\mathit{\varphi}}_{i}({u}_{i})=-\frac{\partial log{p}_{{u}_{i}}({u}_{i})}{\partial {u}_{i}},\phantom{\rule{0.2em}{0ex}}i=1,2,\dots ,n.$$

(9)

with *p _{ui}*(

When approaching convergence, Δ**W**_{}(*t*) 0, assuming a sufficiently small step-size. The stationary points of (6) can guarantee both temporal and spatial statistical independence under the following two conditions

$$\mathit{\text{E}}\phantom{\rule{0.2em}{0ex}}[{\mathit{\varphi}}_{i}({u}_{i}(t))\phantom{\rule{0.2em}{0ex}}{u}_{j}^{\ast}(t-\ell )]=\{\begin{array}{ll}{\delta}_{\ell},& \forall \phantom{\rule{0.2em}{0ex}}\ell \ne 0\phantom{\rule{0.2em}{0ex}}\text{and}\phantom{\rule{0.2em}{0ex}}i=j\\ 0,& \forall \phantom{\rule{0.2em}{0ex}}t,\ell \phantom{\rule{0.2em}{0ex}}\text{and}\phantom{\rule{0.2em}{0ex}}i\ne j\end{array}$$

(10)

where *E*[·] is the statistical expectation, (·)* represents the complex-conjugate operator and *δ*_{} is the Kronecker delta, which is equal to 1 for = 0 and equal to 0 otherwise. As it can be seen from (10), the convergence behavior of the NGA depends solely upon the magnitude of the so-called *estimating function* at each iteration, which is equal to

$${\mathit{\text{R}}}_{\ell}(t)=\mathit{\text{E}}[\mathit{\varphi}(\mathit{\text{u}}(t)){\mathit{\text{u}}}^{H}(t-\ell )]$$

(11)

for lags = 0, 1,, *L* − 1. Since the above estimating function is not equally sensitive to variations from all the updated filter coefficients, a tap-selection criterion can be constructed by employing only *M* out of *L* coefficients with the largest values of |*R*_{}(*t*)| for all lags = 0, 1,…, *L* − 1, at each iteration. The subset of the filter coefficients to be partially updated at any particular time *t* is specified in the (*nL* × *mL*) matrix **Q**(*t*), which is coined the tap selection matrix, and is given by

$$\mathbf{\text{Q}}(t)\triangleq \left(\begin{array}{ccc}\text{diag}\phantom{\rule{0.2em}{0ex}}[{\mathit{\text{q}}}_{11}(t)]& \dots & \text{diag}\phantom{\rule{0.2em}{0ex}}[{\mathit{\text{q}}}_{1m}(t)]\\ \vdots & \ddots & \vdots \\ \text{diag}\phantom{\rule{0.2em}{0ex}}[{\mathit{\text{q}}}_{n1}(t)]& \dots & \text{diag}\phantom{\rule{0.2em}{0ex}}[{\mathit{\text{q}}}_{nm}(t)]\end{array}\right)$$

(12)

where each element of the tap-selection matrix is given by

$${\mathit{\text{q}}}_{ij}(t)\triangleq {\left[{q}_{ij}(t),{q}_{ij}(t-1),\dots ,{q}_{ij}(t-L+1)\right]}^{T}$$

(13)

such that, after dropping the time index *t* for convenience,

$${q}_{ij}(\tau )=\{\begin{array}{ll}1,& \text{if}\phantom{\rule{0.2em}{0ex}}|{r}_{\ell}(\tau )|\in [M\phantom{\rule{0.2em}{0ex}}\text{maxima}\phantom{\rule{0.2em}{0ex}}|{\mathbf{\text{R}}}_{\ell}(t)|]\\ 0,& \text{otherwise}\end{array}$$

(14)

where |·| denotes absolute value and *r*_{}(*τ*) are the elements of (11) at the (sorted) indices *τ* = 0, 1,…, *L* − 1. Every element *q _{ij}*(

$$M=\text{trace}\left[\mathbf{\text{Q}}(t)\right]$$

(15)

where operator trace[·] denotes the sum of the diagonal elements of the input matrix argument. Accordingly, to update only *M* taps in the equalizer **W**_{}(*t*), we can write the *selective-natural gradient algorithm* (S-NGA), as follows

$${\stackrel{\sim}{\mathbf{\text{W}}}}_{\ell}(t+1)={\stackrel{\sim}{\mathbf{\text{W}}}}_{\ell}\phantom{\rule{0.2em}{0ex}}(t)+\Delta {\stackrel{\sim}{\mathbf{\text{W}}}}_{\ell}(t)$$

(16)

$$\Delta {\stackrel{\sim}{\mathbf{\text{W}}}}_{\ell}(t)=\lambda \left[{\stackrel{\sim}{\mathbf{\text{W}}}}_{\ell}(t)-\mathit{\varphi}(\stackrel{\sim}{\mathit{\text{u}}}(t)){\stackrel{\sim}{\mathit{\text{y}}}}^{H}(t-\ell )\right]$$

(17)

$$\stackrel{\sim}{\mathit{\text{y}}}(t)=\sum _{k=0}^{M-1}{\stackrel{\sim}{\mathbf{\text{W}}}}_{M-1-k}^{H}(t)\stackrel{\sim}{\mathit{\text{u}}}(t-k)$$

(18)

$$\stackrel{\sim}{\mathit{\text{u}}}\phantom{\rule{0.2em}{0ex}}(t)=\sum _{k=0}^{M-1}\mathbf{\text{Q}}(t)\mathit{\text{u}}(t-k)$$

(19)

where parameter *λ* represents the new learning rate. Note that for *M* = *L*, the S-NGA algorithm in (16)–(19) reduces to the *full-update* NGA algorithm given in (5)–(7). In general, the separation performance of the S-NGA depends on the degree of coefficient reduction achieved and therefore on the overall number of the sorted filter taps in _{}(*t*).

To assess the performance of the S-NGA, five male speakers are corrupted by interfering speech. The signals are chosen from the IEEE speech corpus, which consists of phonetically balanced sentences, with each sentence being composed of approximately 7 to 12 words (e.g., see [15]). Every sentence produced by a male talker is designated as the *target* speech. To simulate the speech interferer or competing voice in this experiment, a female talker uttering the sentence “Tea served from the brown jag is tasty” is chosen as the *interferer* (or *masker*) source. The source signals are approximately 4 s in duration and are recorded at a sampling rate of 8 kHz. A set of five convolutive speech mixtures are produced by convolving the clean signals with the two *binaural* room impulse responses (BRIRs) depicted in Fig. 1 (a) and (b) (e.g., see [16]). The length of the acoustic impulse responses is 2,048 sample points, corresponding to a delay of around 256 ms at a sampling rate of 8 kHz.

Impulse responses of the cross-channel acoustic paths measured in a small classroom. (a) Impulse response corresponding to the acoustic path from the target source to microphone 2. (b) Impulse response corresponding to the acoustic path from the female **...**

In order to simplify the estimation problem, the channel distortions from the target source to microphone 1 and from the female masker to microphone 2 are assumed negligible. The BRIRs are measured in a 5 × 9 × 3.5 m ordinary classroom using a KEMAR positioned at 1.5 m above the floor and at ear level [16]. The broadband reverberation time^{3} of this enclosure is equal to *T _{R}* = 200 ms, which is a typical value for a moderately reverberant environment. Both speech signals have the same onset and are normalized so that their maximum amplitude is unity. By convolving the speech signals with these pre-measured impulse responses, the target male is positioned directly at the front of the listener at a 0° azimuth, whereas the female interferer is placed at an angle of 60° to the right.

The S-NGA is executed with *L* = 2,048, whereas *M* is set to 2,048, 1,024, 512 and 256 taps. The learning rates are explicitly tuned to yield the maximum possible steady-state performance. In all cases, the equalizer is initialized using a center-tap scheme, such that **W**_{}(0) = *δ*_{ − M/2} **I**. The algorithm operates with the hyperbolic tangent score function *ϕ _{i}*(

We test the S-NGA on a set of convolutive speech mixtures generated using the pre-measured impulse responses in Fig. 1. Table I contrasts performance and reduction in complexity relative to the complexity of the full-update algorithm. As expected, the full-update NGA yields the best SIRI performance. In addition, even when the S-NGA operates with a reduced number of filter coefficients at every iteration, it exhibits only a slightly lower separation performance. SIRI values indicate that the overall degree of separation remains unchanged for *M* = 1,024. In fact, even when setting *M* = 512, which accounts for a 75% reduction in the total equalizer length (with a processing delay of just 64 ms at 8 kHz) and around 40% reduction in computational complexity, the algorithm manages to retain almost 80% of its full-update counterpart performance. The overall computational complexity of the S-NGA in terms of floating point operations per second (FLOPS) is around 50% less when only 3/16 of filter coefficients (amounting to a 81.25% reduction) are adapted at every iteration. In all cases, to ensure that the complexity due to coefficient selection is kept low, we use a fast sorting routine (e.g., see SORTLINE [14]), which only requires an additional log_{2}*L* + 2 tap comparison operations per sample.

We have developed a selective-updating BSS scheme (S-NGA), which can learn multiple filters at a substantially reduced computational overhead, while retaining a satisfactory separation performance. Experiments in reverberant settings, show that the overall tradeoff between filter length reduction and performance loss is acceptable. The S-NGA has great potential for use in portable devices, e.g., hearing aids and cochlear implants, since it can operate with considerably shorter filters and still equalize long acoustic echo paths with sufficient accuracy.

This work was supported by Grants R03 DC 008882 and R01 DC 007527 awarded from the National Institute on Deafness and Other Communication Disorders (NIDCD) of the National Institutes of Health (NIH).

^{1}For example in a two-source and three-sensor configuration convolved with 1,024 sample point acoustic impulse responses, the lower bound for the FIR filter length *L* yielding a perfect system inverse can decrease to around 512 taps, if we assume that the number of microphones doubles.

^{2}Instead of using the definition in (8), we could re-write the update in (6) as Δ**W**_{}(*t*) = *μ* [**W**_{}(*t*) − **(**** u**(

^{3}The parameter *T _{R}* defines the interval in which the reverberating sound energy, due to decaying reflections, reaches one millionth of its initial value. In other words, it is the time it takes for the reverberation level to drop by 60 dB below the original sound energy present in the room at a given instant.

1. Douglas SC, Sawada H, Makino S. Natural gradient multichannel blind deconvolution and speech separation using causal FIR filters. IEEE Trans Speech Audio Process. 2005;13:92–104.

2. Kokkinakis K, Nandi AK. Multichannel blind deconvolution for source separation in convolutive mixtures of speech. IEEE Trans Audio, Speech, Lang Process. 2006;14:200–212.

3. Parra L, Spence C. Convolutive blind separation of nonstationary sources. IEEE Trans Speech and Audio Process. 2000;8:320–327.

4. Smaragdis P. Blind separation of convolved mixtures in the frequency domain. Neurocomp. 1998;22:21–34.

5. Kokkinakis K, Loizou PC. Subband-based blind signal processing for source separation in convolutive mixtures of speech. Proc IEEE Int Conf on Acoust, Speech and Signal Process. 2007:917–920.

6. Ikram MZ, Morgan DR. Permutation inconsistency in blind speech separation: Investigation and solutions. IEEE Trans Speech and Audio Process. 2005;13:1–13.

7. Hofbauer M. On the FIR inversion of an acoustical convolutive mixing system: Properties and limitations. Proc Fifth Int Conf on ICA and BSS. 2004:643–651.

8. Kokkinakis K, Loizou PC. Using blind source separation techniques to improve speech recognition in bilateral cochlear implant patients. J Acoust Soc Am. 2008;123:2379–2390. [PubMed]

9. Douglas SC. Adaptive filters employing partial updates. IEEE Trans Circuits Syst II. 1997;44:209–216.

10. Aboulnasr T, Mayyas K. Complexity reduction of the NLMS algorithm via selective coefficient update. IEEE Trans Signal Process. 1999;47:1421–1424.

11. Khong AWH, Naylor PA. Selective-tap adaptive filtering with performance analysis for identification of time-varying systems. IEEE Trans Audio, Speech, Lang Process. 2007;15:1681–1695.

12. Amari SI, Douglas SC, Cichocki A, Yang HH. Multichannel blind deconvolution and equalization using the natural gradient. Proc IEEE Workshop on Signal Process Adv in Wireless Comms. 1997:101–104.

13. Bell AJ, Sejnowski TJ. An information maximization approach to blind separation and blind deconvolution. Neural Computat. 1995;7:1129–1159. [PubMed]

14. Pitas I. Fast algorithms for running ordering and max/min calculation. IEEE Trans Circuits Syst. 1989;36:795–804.

15. Subcommittee IEEE. IEEE recommended practice speech quality measurements. IEEE Trans Audio Electroacoust. 1969;17:225–246.

16. Shinn-Cunningham BG, Kopco N, Martin TJ. Localizing nearby sound sources in a classroom: Binaural room impulse responses. J Acoust Soc Am. 2005;117:3100–3115. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |