Speech-evoked EEG decoding based on Multi-scale Attention Temporal Encoding Network

doi:10.3785/j.issn.1008-973X.2026.04.021

Journal of ZheJiang University (Engineering Science)

2026, Vol. 60

Issue (4): 896-905 DOI: 10.3785/j.issn.1008-973X.2026.04.021

Speech-evoked EEG decoding based on Multi-scale Attention Temporal Encoding Network

Zihao YAO(

),Hairong JIA*(

),Yarong LI,Guijun CHEN

College of Electronic and Information Engineering, Taiyuan University of Technology, Taiyuan 030024, China

Download:

HTML

PDF(1844KB) HTML
Export: BibTeX | EndNote (RIS)

Abstract

A Multi-scale Attention Temporal Encoding Network (MATE-Net) was proposed to address the issues of complex EEG signal features and difficulty in acquiring elicited covert speech data (whispered and imagined). The relatively abundant overt speech data was leveraged to train the model, which was then applied to covert speech decoding tasks. An Inception-based multi-receptive field module was utilized to extract multi-scale features from the input signals, while a bidirectional GRU architecture was employed to capture contextual dependencies and improve the representation of temporal dynamics. To tackle the training issues of deep networks, a residual connection mechanism was added to ensure robust gradient flow during backpropagation. Moreover, a multi-head attention mechanism was introduced to effectively capture both local and global temporal dependencies, thereby strengthening the representation of salient features in the sequence. Experimental results showed that the model achieved excellent performance in overt speech decoding, with an average accuracy of 74.30% on the test set and Spearman and Pearson correlation coefficients of 0.884 and 0.942, respectively, in five-fold cross-validation. The pre-trained MATE-Net was successfully applied to whispered and imagined speech tasks, enabling effective reconstruction of speech spectrograms.

Key words： brain-computer interface electroencephalography (EEG) overt speech whispered speech imagined speech

Received: 21 April 2025 Published: 19 March 2026

CLC:

TN 911.7

Fund: 国家自然科学基金资助项目（62201377）；山西省基础研究计划资助项目（202403021211098）；山西省研究生创新项目（RC2400005582）.

Corresponding Authors: Hairong JIA E-mail: 1078349047@qq.com;helenjia722@163.com

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Zihao YAO
	Hairong JIA
	Yarong LI
	Guijun CHEN

Cite this article:

Zihao YAO,Hairong JIA,Yarong LI,Guijun CHEN. Speech-evoked EEG decoding based on Multi-scale Attention Temporal Encoding Network. Journal of ZheJiang University (Engineering Science), 2026, 60(4): 896-905.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2026.04.021 OR https://www.zjujournals.com/eng/Y2026/V60/I4/896

基于多尺度注意力时序编码网络的语音诱发脑电解码

针对诱发隐性语音（无声/想象语音）的脑电信号特征复杂且数据获取困难的问题，提出多尺度注意力时序编码网络（MATE-Net）,利用相对丰富的显性语音数据训练模型，应用于隐性语音解码任务. 模型通过Inception多感受野模块提取多尺度特征；引入双向GRU结构有效捕获前后文依赖关系，增强对时序动态的表征能力；为了解决深层网络训练问题，加入残差连接机制，确保梯度在反向传播过程中的稳定性；引入多头注意力机制以有效捕捉局部与全局时序依赖，增强关键特征的表达. 实验结果表明，本模型在显性语音解码任务中展现出良好的性能表现. 在五折交叉验证中，测试集的平均准确率达到 74.30%，且 Spearman 相关系数和 Pearson 相关系数分别为 0.884 与 0.942. MATE-Net的预训练模型能够成功应用于无声语音及想象语音任务，实现语音频谱的有效重构.

关键词： 脑机接口, 脑电图（EEG）, 显性语音, 无声语音, 想象语音

Fig.1 Overall process of speech-evoked EEG decoding

Fig.2 Multi-scale Attention Temporal Encoding Network （MATE-Net） architecture

Tab.1 Hardware and software configuration of experiment

Tab.2 Experimental results of five-fold cross-validation

Tab.3 Ablation study results on impact of key modules in MATE-Net on model performance

Tab.4 Performance comparison between MATE-Net and existing decoding methods

Fig.3 Comparison of original and reconstructed spectrograms

Fig.4 Comparison of original Mel spectrogram and reconstructed Mel spectrogram at word level

Fig.5 Comparison of original and reconstructed speech waveforms

Tab.5 Cross-dataset comparative experimental results

Tab.6 Correlation between overt speech and covert speech

Fig.6 Spectral and waveform comparison between overt and covert speech


[1]	HILARI K, NEEDLE J J, HARRISON K L What are the important factors in health-related quality of life for people with aphasia? a systematic review[J]. Archives of Physical Medicine and Rehabilitation, 2012, 93 (1): S86- S95 doi: 10.1016/j.apmr.2011.05.028

[2]	JELLINGER K A The spectrum of cognitive dysfunction in amyotrophic lateral sclerosis: an update[J]. International Journal of Molecular Sciences, 2023, 24 (19): 14647 doi: 10.3390/ijms241914647

[3]	刘近贞, 叶方方, 熊慧基于卷积神经网络的多类运动想象脑电信号识别[J]. 浙江大学学报: 工学版, 2021, 55 (11): 2054- 2066 LIU Jinzhen, YE Fangfang, XIONG Hui Recognition of multi class motor imagery EEG signals based on convolutional neural network[J]. Journal of Zhejiang University: Engineering Science, 2021, 55 (11): 2054- 2066

[4]	TANG J Effect of brain-computer interface training on functional recovery after stroke[J]. Theoretical and Natural Science, 2023, 21 (1): 75- 79 doi: 10.54254/2753-8818/21/20230821

[5]	LUO S, ANGRICK M, COOGAN C, et al Stable decoding from a speech BCI enables control for an individual with ALS without recalibration for 3 months[J]. Advanced Science, 2023, 10 (35): e2304853 doi: 10.1002/advs.202304853

[6]	ANUMANCHIPALLI G K, CHARTIER J, CHANG E F Speech synthesis from neural decoding of spoken sentences[J]. Nature, 2019, 568 (7753): 493- 498 doi: 10.1038/s41586-019-1119-1

[7]	LI M, LIAO S, PUN S H, et al. Effects of EEG analysis window location on classifying spoken mandarin monosyllables [C]// 11th International IEEE/EMBS Conference on Neural Engineering. Baltimore: IEEE, 2023: 1–4.

[8]	MELINDA M, JUWONO F H, ENRIKO I K A, et al Application of continuous wavelet transform and support vector machine for autism spectrum disorder electroencephalography signal classification[J]. Radioelectronic and Computer Systems, 2023, (3): 73- 90 doi: 10.32620/reks.2023.3.07

[9]	LIU H, ONG Y S, YU Z, et al Scalable Gaussian process classification with additive noise for non-Gaussian likelihoods[J]. IEEE Transactions on Cybernetics, 2022, 52 (7): 5842- 5854 doi: 10.1109/TCYB.2020.3043355

[10]	ALENAZI F S, EL HINDI K, ASSADHAN B Complement-class harmonized Naïve Bayes classifier[J]. Applied Sciences, 2023, 13 (8): 4852 doi: 10.3390/app13084852

[11]	ABDULGHANI M M, WALTERS W L, ABED K H Imagined speech classification using EEG and deep learning[J]. Bioengineering, 2023, 10 (6): 649 doi: 10.3390/bioengineering10060649

[12]	QI H, GAO N. Research on the classification algorithm of imaginary speech EEG signals based on twin neural network [C]// 7th International Conference on Signal and Image Processing. Suzhou: IEEE, 2022: 211–216.

[13]	GASPARINI F, CAZZANIGA E, SAIBENE A. Inner speech recognition through electroencephalographic signals [EB/OL]. (2022–10–11) [2025–04–21]. https://arxiv.org/abs/2210.06472.

[14]	PARK H J, LEE B Multiclass classification of imagined speech EEG using noise-assisted multivariate empirical mode decomposition and multireceptive field convolutional neural network[J]. Frontiers in Human Neuroscience, 2023, 17: 1186594 doi: 10.3389/fnhum.2023.1186594

[15]	VORONTSOVA D, MENSHIKOV I, ZUBOV A, et al Silent EEG-speech recognition using convolutional and recurrent neural network with 85% accuracy of 9 words classification[J]. Sensors, 2021, 21 (20): 6744 doi: 10.3390/s21206744

[16]	CHEN X, WANG R, KHALILIAN-GOURTANI A, et al A neural speech decoding framework leveraging deep learning and speech synthesis[J]. Nature Machine Intelligence, 2024, 6 (4): 467- 480 doi: 10.1038/s42256-024-00824-8

[17]	MARTIN S, BRUNNER P, ITURRATE I, et al Word pair classification during imagined speech using direct brain recordings[J]. Scientific Reports, 2016, 6: 25803 doi: 10.1038/srep25803

[18]	KOMEIJI S, MITSUHASHI T, IIMURA Y, et al Feasibility of decoding covert speech in ECoG with a Transformer trained on overt speech[J]. Scientific Reports, 2024, 14: 11491 doi: 10.1038/s41598-024-62230-9

[19]	CANOLTY R T, EDWARDS E, DALAL S S, et al High gamma power is phase-locked to theta oscillations in human neocortex[J]. Science, 2006, 313 (5793): 1626- 1628 doi: 10.1126/science.1128115

[20]	ABDI H, WILLIAMS L J Principal component analysis[J]. Wiley Interdisciplinary Reviews: Computational Statistics, 2010, 2 (4): 433- 459 doi: 10.1002/wics.101

[21]	SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions [C]// IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 1–9.

[22]	HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770–778.

[23]	DEY R, SALEM F M. Gate-variants of gated recurrent unit (GRU) neural networks [C]// IEEE 60th International Midwest Symposium on Circuits and Systems. Boston: IEEE, 2017: 1597–1600.

[24]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [J]. Advances in Neural Information Processing Systems, 2017, 30.

[25]	SRIVASTAVA N, HINTON G E, KRIZHEVSKY A, et al Dropout: a simple way to prevent neural networks from overfitting[J]. The Journal of Machine Learning Research, 2020, 15: 1929- 1958

[26]	ANGRICK M, OTTENHOFF M C, DIENER L, et al Real-time synthesis of imagined speech processes from minimally invasive recordings of neural activity[J]. Communications Biology, 2021, 4: 1055 doi: 10.1038/s42003-021-02578-0

[27]	LAWHERN V J, SOLON A J, WAYTOWICH N R, et al EEGNet: a compact convolutional neural network for EEG-based brain–computer interfaces[J]. Journal of Neural Engineering, 2018, 15 (5): 056013 doi: 10.1088/1741-2552/aace8c

[28]	SALAMI A, ANDREU-PEREZ J, GILLMEISTER H EEG-ITNet: an explainable inception temporal convolutional network for motor imagery classification[J]. IEEE Access, 2022, 10: 36672- 36685 doi: 10.1109/ACCESS.2022.3161489

[29]	LEE Y E, LEE S H. EEG-transformer: self-attention from transformer architecture for decoding EEG of imagined speech [C]// 10th International Winter Conference on Brain-Computer Interface. Gangwon-do: IEEE, 2022: 1–4.

[30]	INGOLFSSON T M, HERSCHE M, WANG X, et al. EEG-TCNet: an accurate temporal convolutional network for embedded motor-imagery brain-machine interfaces [C]// IEEE International Conference on Systems, Man, and Cybernetics. Toronto: IEEE, 2020: 2958–2965.

[31]	WAN Z, LI M, LIU S, et al EEGformer: a transformer-based brain activity classification method using EEG signal[J]. Frontiers in Neuroscience, 2023, 17: 1148855 doi: 10.3389/fnins.2023.1148855

[1]	Guanghui YAN,Xiao HUANG,Wenwen CHANG. Emergency braking behavior recognition based on EEG multi-scale features and graph neural networks[J]. Journal of ZheJiang University (Engineering Science), 2026, 60(2): 404-414.

[2]	Fan DU,Yong WANG,Jun YAN,Hongxiang GUO. Steady-state visual evoked potential signal recognition based on cross-subject neighboring stimulus learning[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(12): 2472-2482.

[3]	Shuhan WU,Dan WANG,Yuanfang CHEN,Ziyu JIA,Yueqi ZHANG,Meng XU. Attention-fused filter bank dual-view graph convolution motor imagery EEG classification[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(7): 1326-1335.

[4]	Bo ZHONG,Pengfei WANG,Yiqiao WANG,Xiaoling WANG. Survey of deep learning based EEG data analysis technology[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(5): 879-890.

[5]	Yifan ZHOU,Lingwei ZHANG,Zhengdong ZHOU,Zhi CAI,Mengyao YUAN,Xiaoxi YUAN,Zeyi YANG. Classification of group speech imagined EEG signals based on attention mechanism and deep learning[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(12): 2540-2546.

[6]	Ling-wei ZHANG,Zheng-dong ZHOU,Yun-fei XU,Jia-wen WANG,Wen-tao JI,Ze-feng SONG. Classification of imagined speech EEG signals based on feature fusion[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(4): 726-734.

[7]	Jin-zhen LIU,Fang-fang YE,Hui XIONG. Recognition of multi-class motor imagery EEG signals based on convolutional neural network[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(11): 2054-2066.

[8]	TONG Ji-jun, LI Lin, LIN Qin-guang, ZHU Dan-hua. SSVEP brain-computer interface (BCI) system using smoothed pseudo Wigner-Ville distribution[J]. Journal of ZheJiang University (Engineering Science), 2017, 51(3): 598-604.

[9]	ZHU Fan, JIANG Kai, LV Rong-kun, ZHANG Shao-min, ZHENG Xiao-xiang. Small range big-trip pressure lever detection system and its application in brain-computer interface[J]. Journal of ZheJiang University (Engineering Science), 2011, 45(9): 1693-1696.

Viewed

Full text

Abstract

Cited

Shared

Discussed