Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2022, Vol. 56 Issue (9): 1796-1805    DOI: 10.3785/j.issn.1008-973X.2022.09.013
    
Medical image segmentation method combining multi-scale and multi-head attention
Wan-liang WANG1,2(),Tie-jun WANG1,Jia-cheng CHEN1,Wen-bo YOU1
1. College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310023, China
2. College of Information Science and Technology, Zhejiang Shuren University, Hangzhou 310015, China
Download: HTML     PDF(1159KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

A neural network based segmentation model MS2Net was proposed to automatically and accurately extract regions of interest from medical images. In order to better extract context information, a network architecture combining convolution and Transformer was proposed, which solved the problem that traditional convolution operations lacked the ability to acquire long-range dependencies. In the Transformer-based context extraction module, multi-head self-attention was used to obtain the similarity relationship between pixels. Based on the similarity relationship, the features of each pixel were fused, so that the network had a global view, while the relative positional encoding enabled Transformer to retain the structural information of an input feature map. Aiming at making the network adapt to different sizes of regions of interest, the multi-scale features of decoders were used by MS2Net and a multi-scale attention mechanism was proposed. The group channel attention and the group spatial attention were applied to a multi-scale feature map in turns, so that the reasonable multi-scale semantic information was selected adaptively by the network. MS2Net had achieved better intersection-over-union than advanced methods such as U-Net, CE-Net, DeepLab v3+, UTNet on both ISBI 2017 and CVC-ColonDB datasets, which reflected its excellent generalization ability.



Key wordsmedical image segmentation      deep learning      attention      Transformer      multi-scale     
Received: 05 August 2021      Published: 28 September 2022
CLC:  TP 391  
Fund:  国家自然科学基金资助项目(61873240)
Cite this article:

Wan-liang WANG,Tie-jun WANG,Jia-cheng CHEN,Wen-bo YOU. Medical image segmentation method combining multi-scale and multi-head attention. Journal of ZheJiang University (Engineering Science), 2022, 56(9): 1796-1805.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2022.09.013     OR     https://www.zjujournals.com/eng/Y2022/V56/I9/1796


融合多尺度和多头注意力的医疗图像分割方法

为了从医疗图像中自动且准确地提取兴趣区域, 提出基于神经网络的分割模型MS2Net. 针对传统卷积操作缺乏获取长距离依赖关系能力的问题, 为了更好提取上下文信息, 提出融合卷积和Transformer的架构. 基于Transformer的上下文抽取模块通过多头自注意力得到像素间相似度关系, 基于相似度关系融合各像素特征使网络拥有全局视野, 使用相对位置编码使Transformer保留输入特征图的结构信息. 为了使网络适应兴趣区域形态的差异, 在MS2Net中应用解码端多尺度特征并提出多尺度注意力机制. 对多尺度特征图依次应用分组通道和分组空间注意力, 使网络自适应地选取合理的多尺度语义信息. MS2Net在数据集ISBI 2017和CVC-ColonDB上均取得较U-Net、CE-Net、DeepLab v3+、UTNet等先进方法更优的交并比指标, 有着较好的泛化能力.


关键词: 医疗图像分割,  深度学习,  注意力,  Transformer,  多尺度 
Fig.1 MS2Net’s frame diagram and local modules’ details
Fig.2 Multi-scale attention mechanism
%
组合方式 JA DI ACC SEN SPE TJI
BN 77.27 85.71 93.65 84.70 96.27 68.26
BN+MSA 78.08 86.19 93.97 85.95 96.17 69.85
BN+CE 77.91 86.00 93.77 84.63 96.67 70.01
MS2Net 78.43 86.28 93.81 85.96 96.72 71.60
Tab.1 Effects of different module combinations on segmentation performance
%
预训练 JA DI ACC SEN
77.97 85.98 93.45 85.73
78.43 86.28 93.81 85.96
Tab.2 Effect of pre-training on segmentation performance
方式 JA/% DI/% ACC/% SEN/% NP/106 FLOPs/109
堆叠 77.83 85.96 93.79 84.55 22.47 17.59
相加 78.43 86.28 93.81 85.96 21.66 16.96
Tab.3 Influence of feature map fusion mode on segmentation performance during skip connection
%
T JA DI ACC SEN
1 77.97 86.03 93.80 85.42
2 78.08 86.19 93.67 85.56
4 78.43 86.28 93.81 85.96
8 78.19 86.17 93.80 86.18
Tab.4 Influence of different Transformer block numbers on segmentation indexes when number of heads is thirty-two
%
N JA DI ACC SEN
1 77.53 85.69 93.65 85.42
2 77.75 85.91 93.58 85.33
4 77.84 85.91 93.83 84.97
8 77.88 85.85 93.24 86.24
16 78.11 86.22 93.93 85.99
32 78.43 86.28 93.81 85.96
Tab.5 Influence of different number of heads on segmentation indexes when number of Transformer block is four
%
相对位置编码 JA DI ACC SEN
不采用 77.86 85.94 93.76 84.50
采用 78.43 86.28 93.81 85.96
Tab.6 Effect of relative positional encoding on segmentation performance
%
GCA GSA JA DI ACC SEN
77.75 85.87 93.65 84.53
78.13 86.15 93.54 85.72
77.88 85.94 93.49 85.89
78.43 86.28 93.81 85.96
Tab.7 Contrast between group channel attention and group spatial attention on segmentation index
Fig.3 Attention weight visualization of lesions of different scales
%
方式 JA DI ACC SEN
直接融合 77.75 85.87 93.65 84.53
SE 77.59 85.79 93.48 85.25
CBAM 78.02 86.03 93.62 85.74
MS-Dual-Guided 77.83 85.80 93.60 85.86
MSA 78.43 86.28 93.81 85.96
Tab.8 Comparison of segmentation indexes of different fusion methods
%
方法 JA DI ACC SEN SPE TJI
U-Net 72.81 81.78 92.23 80.36 97.33 60.83
Attention U-Net 72.93 81.89 92.10 81.72 96.97 61.27
Swin-Unet 66.04 75.61 90.46 79.11 93.81 52.58
RAUNet 77.26 85.49 93.68 83.48 97.50 69.47
SFUNet 76.15 84.57 93.38 82.98 96.83 67.01
DeepLab v3+
(Xception)
77.37 85.67 93.61 83.96 96.90 69.10
CE-Net 77.46 85.43 93.68 83.84 97.12 70.49
CA-Net 77.16 85.38 93.13 85.80 95.53 68.56
UTNet 77.47 85.51 93.55 87.15 95.48 70.10
MS-Dual-Guided 76.48 84.72 92.65 87.11 94.54 68.17
MS2Net 78.43 86.28 93.81 85.96 96.72 71.60
Tab.9 Comparison of segmentation performance of different algorithms on ISBI 2017 dataset
%
方法 JA DI ACC SEN
U-Net 76.70±1.73 83.97±1.71 97.85±0.19 84.12±3.10
Attention U-Net 76.71±3.00 83.65±2.95 97.98±0.40 83.61±3.35
Swin-Unet 34.36±2.62 44.84±2.92 92.05±0.65 54.79±2.43
RAUNet 82.41±1.79 88.81±1.89 98.63±0.20 89.12±1.71
SFUNet 80.12±0.79 87.23±0.81 98.32±0.13 87.86±1.77
DeepLab v3+
(Xception)
79.34±1.24 85.61±1.41 98.52±0.10 86.19±1.38
CE-Net 81.71±1.65 87.95±1.77 98.50±0.08 88.79±2.08
CA-Net 76.01±1.60 83.73±1.63 97.57±0.29 85.35±1.93
UTNet 78.39±2.32 85.51±2.30 98.20±0.21 85.96±2.68
MS2Net 82.83±1.71 89.19±1.89 98.52±0.22 89.68±1.95
Tab.10 Comparison of segmentation performance of different algorithms on CVC-ColonDB dataset
Fig.4 Comparison of medical image segmentation results of different methods
[1]   LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 3431-3440.
[2]   RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation [C]// International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich: Springer, 2015: 234-241.
[3]   HE K, ZHANG X, REN S, et al Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 37 (9): 1904- 1916
[4]   ZHAO H, SHI J, QI X, et al. Pyramid scene parsing network [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 2881-2890.
[5]   GU Z, CHENG J, FU H, et al CE-Net: context encoder network for 2D medical image segmentation[J]. IEEE Transactions on Medical Imaging, 2019, 38 (10): 2281- 2292
doi: 10.1109/TMI.2019.2903562
[6]   WANG X, GIRSHICK R, GUPTA A, et al. Non-local neural networks [C]// Proceedings of the IEEE conference on computer vision and pattern recognition. Salt Lake City: IEEE, 2018: 7794-7803.
[7]   XING X, YUAN Y, MENG M Q Zoom in lesions for better diagnosis: attention guided deformation network for WCE image classification[J]. IEEE Transactions on Medical Imaging, 2020, 39 (12): 4047- 4059
doi: 10.1109/TMI.2020.3010102
[8]   LIU R, LIU M, SHENG B, et al NHBS-Net: a feature fusion attention network for ultrasound neonatal hip bone segmentation[J]. IEEE Transactions on Medical Imaging, 2021, 40 (12): 3446- 3458
doi: 10.1109/TMI.2021.3087857
[9]   CHEN L C, PAPANDREOU G, KOKKINOS I, et al DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40 (4): 834- 848
[10]   CHEN L C, PAPANDREOU G, SCHROFF F, et al. Rethinking atrous convolution for semantic image segmentation [EB/OL]. [2021-08-01]. https://arxiv.org/abs/1706.05587.
[11]   SZEGEDY C, IOFFE S, VANHOUCKE V, et al. Inception-V4, inception-ResNet and the impact of residual connections on learning [C]// Thirty-first AAAI Conference on Artificial Intelligence. San Francisco: AAAI, 2017: 4278-4284.
[12]   ZHU X, HU H, LIN S, et al. Deformable convNets V2: more deformable, better results [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 9308-9316.
[13]   WANG X, JIANG X, DING H, et al Bi-directional dermoscopic feature learning and multi-scale consistent decision fusion for skin lesion segmentation[J]. IEEE Transactions on Image Processing, 2019, 29: 3039- 3051
[14]   HU J, SHEN L, SUN G. Squeeze-and-excitation networks [C]// Proceedings of the IEEE conference on computer vision and pattern recognition. Salt Lake City: IEEE, 2018: 7132-7141.
[15]   高颖琪, 郭松, 李宁, 等 语义融合眼底图像动静脉分类方法[J]. 中国图象图形学报, 2020, 25 (10): 2259- 2270
GAO Ying-qi, GUO Song, LI Ning, et al Arteriovenous classification method in fundus images based on semantic fusion[J]. Journal of Image and Graphics, 2020, 25 (10): 2259- 2270
doi: 10.11834/jig.200187
[16]   NI Z L, BIAN G B, ZHOU X H, et al. RAUNet: residual attention U-Net for semantic segmentation of cataract surgical instruments [C]// International Conference on Neural Information Processing. Shenzhen: Springer, 2019: 139-149.
[17]   OKTAY O, SCHLEMPER J, FOLGOC L L, et al. Attention U-Net: learning where to look for the pancreas [EB/OL]. [2021-08-01]. https://arxiv.org/abs/1804.03999.
[18]   PARK J, WOO S, LEE J Y, et al. BAM: bottleneck attention module [EB/OL]. [2021-08-01]. https://arxiv.org/abs/1807.06514.
[19]   WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module [C]// Proceedings of the European Conference on Computer Vision. Munich: Springer, 2018: 3-19.
[20]   GU R, WANG G, SONG T, et al CA-Net: comprehensive attention convolutional neural networks for explainable medical image segmentation[J]. IEEE Transactions on Medical Imaging, 2020, 40 (2): 699- 711
[21]   DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale [EB/OL]. [2021-08-01]. https://arxiv.org/abs/2010.11929.
[22]   LIU Z, LIN Y, CAO Y, et al. Swin Transformer: hierarchical vision transformer using shifted windows [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 10012-10022.
[23]   GUO J, HAN K, WU H, et al. CMT: convolutional neural networks meet vision transformers [EB/OL]. [2021-08-01].https://arxiv.org/abs/2107.06263.
[24]   CAO H, WANG Y, CHEN J, et al. Swin-Unet: Unet-like pure Transformer for medical image segmentation [EB/OL]. [2021-08-01]. https://arxiv.org/abs/2105.05537.
[25]   GAO Y, ZHOU M, METAXAS D. UTNet: a hybrid Transformer architecture for medical image segmentation [EB/OL]. [2021-08-01]. https://arxiv.org/abs/2107.00781.
[26]   CODELLA N C F, GUTMAN D, CELEBI M E, et al. Skin lesion analysis toward melanoma detection: a challenge at the 2017 international symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (ISIC) [C]// 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018). Washington DC: IEEE, 2018: 168-172.
[27]   BERNAL J, SANCHEZ J, VILARINO F Towards automatic polyp detection with a polyp appearance model[J]. Pattern Recognition, 2012, 45 (9): 3166- 3182
doi: 10.1016/j.patcog.2012.03.002
[28]   HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.
[29]   WU H, PAN J, LI Z, et al Automated skin lesion segmentation via an adaptive dual attention module[J]. IEEE Transactions on Medical Imaging, 2020, 40 (1): 357- 370
[1] Jin-zhen LIU,Fei CHEN,Hui XIONG. Open electrical impedance imaging algorithm based on multi-scale residual network model[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(9): 1789-1795.
[2] Kun HAO,Kuo WANG,Bei-bei WANG. Lightweight underwater biological detection algorithm based on improved Mobilenet-YOLOv3[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(8): 1622-1632.
[3] Ren-peng MO,Xiao-sheng SI,Tian-mei LI,Xu ZHU. Bearing life prediction based on multi-scale features and attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(7): 1447-1456.
[4] Yong-sheng ZHAO,Rui-xiang LI,Na-na NIU,Zhi-yong ZHAO. Shape control method of fuselage driven by digital twin[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(7): 1457-1463.
[5] Guo-mei JIN,Qing-shan SHE,Min ZHANG,Yu-liang MA,Jian-hai ZHANG,Ming-xu SUN. Functional cortical muscle coupling method of multi-scale compensated transfer entropy[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(6): 1152-1158, 1256.
[6] You-wei WANG,Shuang TONG,Li-zhou FENG,Jian-ming ZHU,Yang LI,Fu CHEN. New inductive microblog rumor detection method based on graph convolutional network[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(5): 956-966.
[7] Cheng CHEN,Hao ZHANG,Yong-qiang LI,Yuan-jing FENG. Knowledge graph link prediction based on relational generative graph attention network[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(5): 1025-1034.
[8] Xiao-chen JU,Xin-xin ZHAO,Sheng-sheng QIAN. Self-attention mechanism based bridge bolt detection algorithm[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(5): 901-908.
[9] Li HE,Shan-min PANG. Face reconstruction from voice based on age-supervised learning and face prior information[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(5): 1006-1016.
[10] Meng XU,Dan WANG,Zhi-yuan LI,Yuan-fang CHEN. IncepA-EEGNet: P300 signal detection method based on fusion of Inception network and attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(4): 745-753, 782.
[11] Yan BIAN,Yu-sheng GONG,Guo-peng MA,Chang WANG. Water extraction from unmanned aerial vehicle remote sensing images[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(4): 764-774.
[12] Chang-yuan LIU,Xian-ping HE,Xiao-jun BI. Efficient network vehicle recognition combined with attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(4): 775-782.
[13] Xue-qin ZHANG,Tian-ren LI. Breast cancer pathological image classification based on Cycle-GAN and improved DPN network[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(4): 727-735.
[14] Qiao-hong CHEN,Hao-lei PEI,Qi SUN. Image caption based on relational reasoning and context gate mechanism[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(3): 542-549.
[15] Jing-hui CHU,Li-dong SHI,Pei-guang JING,Wei LV. Context-aware knowledge distillation network for object detection[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(3): 503-509.