|Home | About | Journals | Submit | Contact Us | Français|
Over the past few decades, there have been many studies of aspects of brain–computer interface (BCI). Of particular interests are event-related potential (ERP)-based BCI spellers that aim at helping mental typewriting. Nowadays, audiovisual unimodal stimuli based BCI systems have attracted much attention from researchers, and most of the existing studies of audiovisual BCIs were based on semantic incongruent stimuli paradigm. However, no related studies had reported that whether there is difference of system performance or participant comfort between BCI based on semantic congruent paradigm and that based on semantic incongruent paradigm.
The goal of this study was to investigate the effects of semantic congruency in system performance and participant comfort in audiovisual BCI. Two audiovisual paradigms (semantic congruent and incongruent) were adopted, and 11 healthy subjects participated in the experiment. High-density electrical mapping of ERPs and behavioral data were measured for the two stimuli paradigms.
The behavioral data indicated no significant difference between congruent and incongruent paradigms for offline classification accuracy. Nevertheless, eight of the 11 participants reported their priority to semantic congruent experiment, two reported no difference between the two conditions, and only one preferred the semantic incongruent paradigm. Besides, the result indicted that higher amplitude of ERP was found in incongruent stimuli based paradigm.
In a word, semantic congruent paradigm had a better participant comfort, and maintained the same recognition rate as incongruent paradigm. Furthermore, our study suggested that the paradigm design of spellers must take both system performance and user experience into consideration rather than merely pursuing a larger ERP response.
The technology of brain–computer interfaces (BCIs) provides a novel communication method for people with severe neuro-muscular diseases, such as severe paralysis and amyotrophic lateral sclerosis (ALS) [1, 2]. Specifically, a BCI system can translate signals generated by brain activities into control signals to allow command of specific devices that could help patients communicate with the external environment without the participation of muscles . Compared with several common BCI systems, for instance, steady-state visual evoked potential (SSVEP) based BCIs and sensorimotor rhythms (SMR) based BCIs, event-related potential (ERP) based BCIs are generally considered to be more stable and more efficient , characteristics that have resulted in development of ERP-based applications being a primary control signal for BCI systems.
Speller is a typical BCI applications that aims at helping patients communicate with others by mental typewriting. Pioneers in this field have developed many classical ERP-based speller paradigms. Most of the paradigms in early researches were visual-based unimodal stimuli. For instance, a P300 based paradigm was proposed by Farwell and Donchin  in 1988, considered as the first realization of a BCI speller and still used today. In this paradigm, a 6*6 character matrix is presented on the screen, with a random sequence of 12 flashes consisting of six rows and six columns that constitutes an oddball paradigm , to evoke a P300 response. Besides, with the progress of ERP-based spellers, some unimodal paradigms based on other sensory stimuli, such as audio-only stimuli [7, 8], and tactile sense-only stimuli [9, 10] were proposed as well. Nowadays, hybrid BCI system slowly entered the vision of researchers [11, 12], especially multimodal stimuli paradigm combining auditory-and visual-based speller have attracted much attention from researchers [13–17]. For one thing, some researches pointed out multimodal stimuli based spellers can address the limitation of eye gaze ability [13, 18–20]. For another, some researchers found the system performance of audiovisual spellers were better than that of unimodal stimuli based spellers [14, 16]. To be more specific, Belitski et al. . implemented an online experiment of multimodal stimuli combining audio and visual data, and compared the performance among an audiovisual stimuli paradigm, an audio-only stimuli paradigm, and a traditional matrix paradigm. The results indicated that the audiovisual stimuli-based speller performed better than the traditional matrix speller and the unimodal auditory stimuli-based speller. The same result was also found in Wang’s research . Additionally, Wang indicated that the ERP of multimodal stimuli was significant stronger than that of each unimodal stimuli, and it was significantly different from the sum of ERPs of the two unimodal stimuli. An et al. . explored audiovisual stimuli for gaze-independent BCI from the perspectives of both behavioral and ERP data. The result showed that there were significant difference in both performance and brain activity between multimodal and unimodal stimuli-based spellers.
Most of the existing studies of audiovisual spellers are based on semantic incongruent stimuli [13–16]. This is because incongruent stimuli requires more attention of participants from the perspective of neurophysiology, which may allow larger amplitude ERPs compared with congruent stimuli. A similar viewpoint was verified in Andres’ research . However, there were still some studies focusing on semantic congruent spellers. For instance, Laurienti et al. . indicated that semantic congruent stimuli can significantly reduce reaction time through an experiment that used two circles colored red or blue as visual stimuli on combination with the pronunciation of ‘red’ and ‘blue’ as audio stimuli. A speller based on semantic congruent stimuli involving visual and spoken numbers was proposed , and its system performance was as good as that of traditional single modal stimuli based spellers. Overall, various approaches of multimodal based paradigm combining auditory and visual stimuli have been developed. However, there has not yet been a study that compares congruent stimuli and incongruent semantic stimuli from the perspective of the BCI speller.
With the exception of system performance, the user comfort of spellers is also becoming a higher priority for researchers. Earlier studies demonstrated the potential for physical and mental discomfort  for long-time use of spellers. Besides, semantic congruency may cause some psychological effects on the comfort, fatigue and mental workload of users. Thus, from the perspective of ergonomics, a well-designed speller should give consideration to both system performance and user comfort. This study focused on this issue, and attempted to compare the congruent and incongruent paradigms from both the aspect of behavioral data and ERP, to gain insight into the effect of semantic congruency on audiovisual stimuli paradigm based spellers.
11 healthy participants (5 females) aged 22–33 (mean ± SD, 23.9 ± 1.14 years) took part in the experiment. All of the participants had normal hearing and normal or corrected-to-normal vision. All participants provided written informed consent prior to the experiment, and they volunteered to take part in the experiment.
Two paradigms combining visual and audio were adopted in this experiment. The visual stimuli were the same between the two audiovisual stimuli paradigms, however, the audio stimuli had two different choices. For the semantic congruent paradigm, the pronunciation of each audio stimuli was congruent with the corresponding visual stimuli. For instance, if the current visual stimuli was the letter “a”, the corresponding audio stimuli was the syllable ‘ei’ (the pronunciation of the letter ‘a’), as shown in Fig. 1(A1). For the semantic incongruent paradigm, the audio stimuli had little relationship with the visual stimuli letters, as shown in Fig. 1(A2). For both models, all audio stimuli and the corresponding visual stimuli were presented simultaneously.
Visual stimuli contained four different letters: ‘a’, ‘b’, ‘c’, ‘d’, and each letter had a unique color as Fig. 1 shows. These letters were presented in a random sequence at the center of a 19′ TFT screen with a refresh rate of 60 Hz. The inter stimulus interval (ISI) between two adjacent stimuli was 200 ms with a duration of 130 ms (the same as the duration of an audio stimuli).
Audio stimuli consisted of two conditions, For semantic congruent audio stimuli, we selected four short spoken syllables (‘ei’, ‘bi’, ‘si’, and ‘di’) to match the visual stimuli, with ‘ei’ and ‘si’ on the left channel, ‘bi’ and ‘di on the right channel. For semantic incongruent condition, the syllables ‘ti’, ‘to’, ‘it’, and ‘ot’ used in previous studies [24, 25] were adopted, with ‘ti’, ‘it’ on the left channel and ‘to’, ‘ot’ on the right channel. The duration of each stimulus was also 130 ms, and the ISI was 200 ms in both conditions. The audio stimuli were presented through comfortably positioned in-ear headphones, as Fig. 1 shows.
The experimental paradigm was implemented in e-prime 2.0. Participants were instructed to sit in a comfortable position and keep their eyes staring at the center of the screen, with minimum eye movements or any other muscle artifacts throughout the whole experiment. Experiment for each participant consisted of two paradigms (semantic congruent paradigm and semantic incongruent paradigm). Each paradigm repeated twice, comprising four blocks. In each block, a random sequence of four single trial stimuli was repeated ten times. The experiment lasted about 8 min, and then the subjects were asked to relax for 5 min, before repeating the experiment.
At the end of the experiment, all participants were asked about which stimuli paradigm they preferred based on their participant comfort. In detail, evaluation of comfort in this study mainly took into account three questions: which paradigm is more difficult; which paradigm makes you feel more fatigue; which paradigm will you choose if there is another experiment. When a paradigm was chosen in two or three of the three questions, it was assumed the participant preferred this paradigm. There is currently no single evaluation method of participant comfort. Garcia et al.  compared the user comfort of different P300 speller configurations with NASA Task Load Index (NASA-TLX), while Ekandem et al.  evaluated the user comfort with the duration of BCI use. These studies considered only one factor of comfort evaluation, neither of them was the best choice. Our study considered the difficulty degree of experiment, the fatigue degree of participants and the willingness of re-participating in the experiment comprehensively in comfort evaluation, which was a more convincing choice.
The EEG signal was recorded from 64 Ag/AgCl scalp electrodes placed according to the positions of the extended International 10–20 system, and amplified by a Neuroscan NuAmp amplifier with a sampling rate of 500 Hz. During the data acquisition, all channels were referenced to the nose tip, with the ground electrodes placed at the frontal area neighboring the AFz channel. Electrode impedances were kept below 10 KΩ during data acquisition.
In data preprocessing, the raw EEG data were first re-referenced to the average signal of the left and right mastoid. Next, eye potential and motion artifacts were removed through the method of independent component analysis (ICA). After removal of eye potential and motion artifacts, the data was filtered with a 0.5–40 Hz band-pass filter, and down-sampled to 200 Hz for further analysis. Finally, the epoch data were extracted from −200 to 800 ms after each stimulus onset and the baseline was removed by subtracting the mean value between −200 and 0 ms.
For better classification performance, the channels and the features used for offline classification were selected based on classifiability analysis. Classifiability is a parameter indicating the difference between target and non-target stimuli, and it is usually expressed by the r2-value  defined as
where, M T and M N are the sample size of the target and non-target respectively; X T and X N are the selected features vector of the target and non-target respectively.
Usually, the classification performance of speller depends not only on the amplitude of ERP data, but also on the classifiability of selected ERP features between target and non-target stimuli. Thus, the analysis of r2-value can provide the mathematic foundation for selecting channels and the features of each channel. Specifically, ERP data of 3 time intervals of 180–280, 300–450, and 480–530 ms down-sampled to 40 Hz with the greatest average r2 value of 10 channels (Cz, Pz, CPz, Oz, PO7, PO8, FC1, FC2, FC5, and FC6) based on the whole dataset were selected as features. Thus, 10*12 = 120 features for each stimuli were used for classifying.
Two classic classification algorithms, support vector machine (SVM) and stepwise linear discriminant analysis (SWLDA) were implemented for the binary classification. These two algorithms have been shown to be the most effective classifiers in previous speller study . The SVM approach shows many unique advantages in solving small sample, nonlinear, and high dimensional pattern recognition problems. Two patterns are separated by SVM with a hyper-plane that has the maximum distance from the two patterns. Linear learning kernel was chosen for the SVM model since previous study indicated that linear kernel had a better performance compared with nonlinear method like Gaussian kernel , and the penalty coefficient was optimized by tenfold cross validation when training the SVM model. The algorithm has previously been implemented in LibSVM . SWLDA is an extensive algorithm based on LDA. SWLDA can select the features used for calculation for better classification performance compared with LDA. To predict the target label, input features for analysis were weighted by ordinary least-squares regression. In this way, at last 60 features were selected for final analysis with the union of backward and forward stepwise calculations [29, 31]. Besides, three quarters of the total data was used for training the classifier, and the remaining one quarter was used for testing.
To investigate differences in performance between the congruent and incongruent paradigm in detail, single trial accuracy, character accuracy, and the recognition rate with increased repetition were compared by bootstrap t test.
Bootstrapping-based t tests and ANOVAs were performed to analyze the effect of the semantic congruency on system performance and brain activities. As a statistic method published by Efron , bootstrapping approach does not depend on the normality of the sample distribution, which is a significant advantage compared with the traditional parametric statistic method. In detail, bootstrapping reestablishes the distribution of the parent samples by repeated sampling of the original samples, usually, with an iteration of 1000. In this way, the confidence interval under a certain significance level can be obtained, which can help evaluate whether the difference between two conditions is significant or not. In addition, false discovery rate (FDR) correction was performed for multiple comparisons. Further background information about the features and advantages of bootstrapping analysis is provided in Hesterberg et al. .
Figure 2 shows the result of grand-averaged character recognition rate. Figure 2a shows the recognition rate along with the number of repetitions, and Fig. 2b shows the p values of t test between the recognition rates of the two paradigms. As depicted in Fig. 2a, the character recognition rate increased with the increase of the repetition number for both SWLDA and SVM classification algorithms. For further analysis, the bootstrapping t test was performed to compare the recognition rate between the incongruent and congruent paradigm in the two classification methods. Shown in Fig. 2b, the t test result suggests that the data does not reach the significance level of 0.05, and clearly demonstrates that there is no significant difference between the congruent and incongruent paradigm in the character recognition rates obtained by SWLDA and SVM.
The offline single trial classification accuracy was determined by SWLDA and SVM, and the paradigm preferences of the 11 subjects were recorded. As Table 1 shows, for incongruent stimuli paradigm, the average single trial accuracies across 11 subjects obtained by algorithms of SWLDA and SVM are 70.1 and 69.3%, respectively. For congruent stimuli paradigm, the corresponding average single trial accuracies are 67.8 and 67.6%, respectively. The result indicates little difference between the two paradigms in single trial classification accuracies obtained by both SWLDA and SVM according to Table 1, which was verified by 1000 iteration bootstrapping t test analysis. The single trial accuracy obtained by SWLDA of the two paradigms were compared: t(10) = 1.66, p = 0.12, and by SVM: t(10) = 1.06, p = 0.316. Overall, there is no significant difference between the paradigms in single trial classification accuracy. Additionally, the recorded comfort information indicates that most subjects preferred the semantic congruent paradigm: 8 of 11 felt more comfortable under the congruent paradigm, two found no difference between the two paradigms, and only one subject preferred the incongruent condition.
Figure 3a shown the significant p values over time for 62 scalp electrodes of bootstrapping comparison between ERPs of congruent and incongruent paradigm. ERPs evoked by both target and non-target stimuli were compared, and the result was corrected by FDR, since hundreds of comparisons were implemented simultaneously. To examine the differences of some ERP components in different brain regions, scalp maps of four specific time points including 200, 350, 440, and 620 ms (representing N2, P3, N4, and P5, respectively) are shown at the bottom of Fig. 3a. According to the figure, the main conclusions can be summarized as follows: from the aspect of spatio-temporal distribution difference, the paradigm type had a great influence on brain response for both target and non-target stimuli. In detail, for non-target stimuli, the significant difference was mainly observed in the posterior brain area for the time interval of 190–210 ms, the whole brain area for the time intervals of 275–300 and 490–520 ms, and the anterior brain area for the time interval of 310–340 ms. For target stimuli, there was significant difference mainly in the 200–220 ms interval for the whole brain area, 280–300 ms for the anterior brain area, and 390–410 ms for the posterior brain area. From the aspect of scalp map, the main components of ERP differed in different brain areas between target and non-target stimuli. Specifically, for non-target stimuli, there was significant difference in the whole brain area of N2, P3, and the posterior brain area of P5. For target stimuli, there was significant difference in almost the whole brain area of N2, the anterior and the center brain area of P3, and the posterior brain area of N4.
Figure 3b depicts the grand-averaged ERP waveforms for the selected electrodes CZ, POZ, and FC6 of 11 participants for both target and non-target stimuli. As main components of ERP evoked by oddball paradigm, N2 and P3 can be observed clearly in Fig. 3b. For further analysis, the amplitude and the latency of N2 and P3 components for the three selected electrodes were selected, and 1000 iterations of the bootstrap t test were performed to compare the amplitude and the latency of N2 and P3 components between the two paradigms. As shown in Fig. 4, there was significant difference of amplitude at N2 of electrode CZ (t(10) = −2.56, p = 0.03), and P3 of electrodes CZ (t(10) = 2.82, p = 0.02), POZ (t(10) = 2.31, p = 0.04), and FC5 (t(10) = 3.24, p = 0.01), but no significant difference in latency was observed in both N2 and P3 for the three electrodes.
Calculated using formula (1), the classifiabilites between target and non-target stimuli over time for all 62 electrodes are depicted in Fig. 5a. Classifiabilities were analyzed for both paradigms based on the whole dataset. The average values for three time intervals selected as features including 180–280, 300–450, and 480–530 ms are depicted as scalp maps in the bottom panel of Fig. 5a. According to the data shown in Fig. 5a, two major conclusions can be obtained: (1) from the distribution of spatio-temporal classifiability, higher classifiability values occur in the time interval of 200–500 ms for both paradigms. (2) from the aspect of scalp map, higher classifiability values were observed in the posterior brain area for 180–280 ms, and the center brain area for 300–450 ms, for both paradigms. In addition, for time interval 480–530 ms, higher values were observed in the right and left posterior brain area for the incongruent paradigm, and only the posterior brain area for the congruent paradigm.
Figure 5b depicts the bootstrapping t test results for classifiabilities of the incongruent and congruent paradigm. FDR correction was performed as hundreds of comparisons were implemented simultaneously. It can be clearly seen in Fig. 5b that very few significant differences are observed around PZ electrode at 370 ms approximately. Beyond that, there was little difference in time or space, and classifiability can account for the comparison result of offline classification accuracy.
The ERP-based speller is one of the most stable communication systems for patients with severe neuro-muscular diseases. However, there have been only few studies investigating factors that may affect system performance or user comfort for these systems. In this study, the effect of semantic congruency toward audiovisual BCI was investigated for overall system performance and participant comfort. Furthermore, high-density electrical mapping of ERPs were analyzed to explain the obtained results.
First, the t test result of offline classification accuracy suggested that semantic congruency between auditory and visual stimuli had little effect on system performance. However, we found an interesting phenomenon that significant larger P300 waveforms were obtained when the incongruent paradigm was applied compared with the congruent paradigm, which was completely opposite to the result that larger P300 waveforms lead to higher accuracies [34, 35]. Additionally, semantic congruency had a significant influence on participant comfort, since 8 of 11 reported that they felt more comfortable while in the semantic congruent paradigm. Semantic incongruent stimuli was a more complex paradigm in general, thus, participants must maintain increased focus and attention and stronger brain activity during an incongruent paradigm experiment compared with a congruent one. Spontaneously, a sense of discomfort emerges slowly as the experiment time increases. Furthermore, the results of ERP analysis confirmed our inference. The bootstrapping results in Fig. 3a indicated significant difference around 280–350 ms in almost the whole brain area. Additionally, the main component of ERPs, P3 and N2 were captured and compared in Fig. 4. Previous results indicated that the N2 component was associated with the interaction of auditory and visual . P3 component was found to have a close relationship with workload and attention allocation for a certain task [37, 38]. Therefore, higher P3 amplitude implies stronger brain activities. These conclusions supported our experimental results directly.
As we mentioned above, there has been increased attention paid to user comfort in BCI research, resulting in this becoming an important indicator for BCI evaluations, and more generally, for human computer interface (HCI) evaluations. Kübler et al. adapted the user-centered design (UCD) concept to BCI research and development, and assessed user satisfaction with questionnaires and visual-analogue scales . Ekandem et al. compared user comfort, experiment, and preparation time between two different BCI devices from an ergonomic perspective . However, the quantification of user comfort remains a challenging problem because there has been no publication of a general evaluation for it. The result of our research indicated that a larger P3 amplitude of ERP corresponds to poorer participant comfort, while a smaller P3 amplitude of ERP corresponds to better participant comfort on the contrary. This phenomenon revealed a potential relationship between ERP amplitude and participant comfort, which suggested that the evaluation of user comfort in BCI might be researched from the perspective of physiological parameters.
In addition, our research found that a larger ERP amplitude did not lead to higher system performance. A large part of the reason may be that non-target stimuli elicited brain response increases simultaneously with the brain response to target stimuli, as shown in Fig. 3b. This was an interesting phenomenon and an important research issue, because it indicated that non-target stimuli can also elicit a steady brain response that differs in different stimulus paradigms. Furthermore, these findings can provide some suggestions for future design of a speller paradigm. Specifically, the design of speller paradigms should not focus solely on obtaining a large ERP amplitude. For one thing, a larger ERP amplitude may not lead to a higher system performance. For another, we might not have both better system performance and better user comfort at the same time. In another word, the design of speller paradigm must take both system performance and user comfort into consideration, and keep balance between the two factors.
Finally, it should be noted that only an offline experiment was implemented in this study, and our results lack a criterion or a detailed scale for participant comfort evaluation. To further explore the difference between the two paradigms, future studies should include an online experiment and a detailed participant comfort evaluation scale.
In conclusion, this study was designed to investigate the effects of semantic congruency of auditory and visual stimuli for audiovisual speller. Behavioral data and ERP data were recorded for analysis and comparison. The result suggested that although congruency between auditory and visual stimuli in an audiovisual BCI speller had no significant effect on system performance, it had a great influence on participant comfort and the brain activities of participants. Furthermore, our study suggested that the paradigm design of spellers must take both system performance and user experience into consideration, rather than merely pursuing a larger ERP response.
XA and YC designed the experiment and wrote the manuscript. ZZ, HY and YC implemented the experiment. YC, XA and YK accomplished the data processing. XJ and MD revised the manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
All authors consent to publish this paper.
The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.
Our research was approved by Ethnics committee of Tianjin University.
This work was supported in part by National Natural Science Foundation of China (NOs. 91520205, 81571762, 31500865, 51377120 and 31271062, 81671861), Tianjin Key Technology R&D Program (NO. 15ZCZDSY00930), Natural Science Foundation of Tianjin (NOs. 15JCYBYC29600 and 13JCQNJC13900), National Natural Science Foundation of China Youth Fund (NO. 61603269).
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Yong Cao and Xingwei An contributed equally to this work
Yong Cao, Email: moc.361@39_gnoyoac.
Xingwei An, Email: nc.ude.ujt@iewgnixna.
Yufeng Ke, Email: nc.ude.ujt@ekecneralc.
Jin Jiang, Email: moc.anis@81008120nijgnaij.
Hanjun Yang, Email: moc.621@00005h.
Yuqian Chen, Email: nc.ude.ujt@naiquynehc.
Xuejun Jiao, Email: moc.anis@emsijxj.
Hongzhi Qi, Email: nc.ude.ujt@zhq.
Dong Ming, Email: nc.ude.ujt@gnimdrahcir.