Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2024, Vol. 58 Issue (2): 247-256    DOI: 10.3785/j.issn.1008-973X.2024.02.003
    
Dynamic sampling dual deformable network for online video instance segmentation
Yiran SONG1(),Qianyu ZHOU1,Zhiwen SHAO1,2,Ran YI1,Lizhuang MA1,*()
1. Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
2. College of Computer Science and Technology, China University of Mining and Technology, Xuzhou 221116, China
Download: HTML     PDF(1876KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

The dynamic sampling dual deformable network (DSDDN) was proposed in order to enhance the inference speed of video instance segmentation by better using temporal information within video frames. A dynamic sampling strategy was employed, which adjusted the sampling policy based on the similarity between consecutive frames. The inference process for the current frame was skipped for frames with high similarity by utilizing only segmentation results from the preceding frame for straightforward transfer computation. Frames with a larger temporal span were dynamically aggregated for frames with low similarity in order to enhance information for the current frame. Two deformable operations were additionally incorporated within the Transformer structure to circumvent the exponential computational cost associated with attention-based methods. The complex network was optimized through carefully designed tracking heads and loss functions. The proposed method achieves an inference accuracy of 39.1% mAP and an inference speed of 40.2 frames per second on the YouTube-VIS dataset, validating the effectiveness of the approach in achieving a favorable balance between accuracy and speed in real-time video segmentation tasks.



Key wordsvideo      online inference      instance segmentation      dynamic network      dual deformable network     
Received: 27 June 2023      Published: 23 January 2024
CLC:  TP 391  
Fund:  Shanghai Science and Technology Commission (21511101200); National Natural Science Foundation of China (72192821); Shanghai Sailing Program (22YF1420300); CCF-Tencent Open Research Fund (RAGR20220121); Young Elite Scientists Sponsorship Program by CAST (2022QNRC001); National Natural Science Foundation of China (62302297)
Corresponding Authors: Lizhuang MA     E-mail: songyiran@sjtu.edu.cn;lzma@sjtu.edu.cn
Cite this article:

Yiran SONG,Qianyu ZHOU,Zhiwen SHAO,Ran YI,Lizhuang MA. Dynamic sampling dual deformable network for online video instance segmentation. Journal of ZheJiang University (Engineering Science), 2024, 58(2): 247-256.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2024.02.003     OR     https://www.zjujournals.com/eng/Y2024/V58/I2/247


基于动态采样对偶可变形网络的实时视频实例分割

为了更好地利用视频帧中蕴含的时间信息,提升视频实例分割的推理速度,提出动态采样对偶可变形网络 (DSDDN). DSDDN使用动态采样策略, 根据前、后帧的相似性调整采样策略. 对于相似性高的帧, 该方法跳过当前帧的推理过程,仅使用前帧分割进行简单迁移计算. 对于相似性低的帧, 该方法动态聚合时间跨度更大的视频帧作为输入,对当前帧进行信息增强. 在Transformer结构里,该方法额外使用2个可变形操作, 避免基于注意力的方法中的指数级计算量. 提供精心设计的追踪头和损失函数,优化复杂的网络. 在YouTube-VIS数据集上获得了39.1%的平均推理精度与40.2 帧/s的推理速度,验证了提出的方法能够在实时视频分割任务上取得精度与推理速度的良好平衡.


关键词: 视频,  实时推理,  实例分割,  动态网络,  对偶可变形网络 
Fig.1 Framework of DSDDN
Fig.2 Feature analysis of YouTube-VIS 2019 dataset
Fig.3 Visual results of frames using baseline method and dynamic sampling dual deformable network
Fig.4 Framework of dual deformable Transformer
方法mAP/%AP50/%AP75/%
MaskTrack R-CNN 50[8]30.351.132.6
MaskTrack R-CNN 101[8]41.853.033.6
MaskProp 50 [10]40.042.9
MaskProp 101 [10]42.545.6
*VisTR 50 [11]36.259.836.9
*VisTR 101 [11]40.164.045.0
CrossVIS 50 [3]36.356.838.9
CrossVIS 101[3]36.657.339.7
CompFeat 50 [31]35.356.038.6
*IFC 50 [22]41.062.145.4
STC [32]36.757.238.6
VSTAM [33]39.062.941.8
SipMask 50 [2]33.754.135.8
DSDDN 5037.559.141.9
DSDDN 10139.160.743.5
Tab.1 Comparisons of video instance segmentation on YouTube-VIS 2019 validation dataset
方法类型v/(帧·s?1)mAP/%
MaskTrack R-CNN[8]online32.830.3
CrossVIS [3]online39.834.8
VisTR [11]offline51.136.2
CompFeat[31]online32.835.3
SipMask [2]online35.533.7
STEm-Seg [35]Near Online4.4034.6
DSDDNonline40.237.5
Tab.2 Efficiency comparisons on YouTube-VIS-2019 validation set
方法mAP/%AP50/%AP75/%
MaskTrack-RCNN [9]28.648.929.6
SipMask [2]31.752.534.0
CrossVIS [3]34.254.437.9
IFC [22]36.657.939.3
DSDDN34.855.937.4
Tab.3 Accuracy comparison based on YouTube-VIS 2021 validation set
DSODDT输出头v/(帧·s?1)ttr/hmAP/%
31.2500036.5
43.1510035.1
41.5105036.7
40.2110037.5
Tab.4 Ablation experiment on DSO and DDT
$ \boldsymbol{\tau } $v/(帧·s?1)mAP/%AP50/%AP75/%
1.029.738.759.943.2
0.840.137.560.143.7
0.652.435.156.339.2
Tab.5 Ablation experiment results of threshold $ \tau $ sets in reuse gate function
smAP/%smAP/%
1 31.3 10 37.3
5 36.7 15 37.9
Tab.6 Ablation experiment results of sampling stride based on YouTube-VIS 2019 set
方法mAP/%v/(帧·s?1)
复制35.347.1
位移图38.236.6
混合37.440.3
Tab.7 Influence of using different method on mAP and inference speed
层数mAPDTE/%mAPCATE/%mAPDTD/%
1 34.7 36.5 35.2
2 36.1 36.9 37.7
3 36.5 37.3 37.1
4 36.6 37.7 37.3
5 36.3 35.9 37.5
6 36.6 35.4 37.4
Tab.8 Ablation study on number of layers in transformer blocks
检测可信度loU类别一致性mAP/%
35.7
36.6
36.1
37.4
Tab.9 Influence of using different cues on track head
[1]   YANG L, FAN Y, XU N. Video instance segmentation [C] // Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 5188-5197.
[2]   CAO J, WU X, SHEN C. Sipmask: spatial information preservation for fast image and video instance segmentation [C] // European Conference on Computer Vision. Glasgow: Springer, 2020.
[3]   YANG S, ZHOU L, HUANG Q. Crossover learning for fast online video instance segmentation [C] // Proceedings of the IEEE/CVF International Conference on Computer Vision. [S. l.]: IEEE, 2021: 8043-8052.
[4]   LIU D, HUANG Y, YU J. SG-Net: spatial granularity network for one-stage video instance segmentation [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S. l.]: IEEE, 2021: 9816-9825.
[5]   HE K, GAURAV G, ROSS G. Mask R-CNN [C]//Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2961-2969.
[6]   BOLYA D, WANG C, JIA Y. Yolact: real-time instance segmentation [C] // Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 9157-9166.
[7]   TIAN Z, SHEN C, CHEN H. Conditional convolutions for instance segmentation [C]// European Conference on Computer Vision. Glasgow: Springer, 2020: 282–298.
[8]   CHEN H, ZHANG X, YUAN L. BlendMask: top-down meets bottom-up for instance segmentation [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S. l.]: IEEE, 2020: 8573-8581.
[9]   BERTASIUS G, TORRESANI L. Classifying, segmenting, and tracking object instances in video with mask propagation [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S. l.]: IEEE, 2020: 9739-9748.
[10]   VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C] // Advances in Neural Information Processing Systems. Los Angeles: Curran Associates, 2017: 5998-6008.
[11]   WANG Y, FAN Y, XU N. End-to-end video instance segmentation with transformers [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S. l.]: IEEE, 2021: 8741-8750.
[12]   CARION N, TOUVRON H, VEDALDI A. End-to-end object detection with transformers [C] // European Conference on Computer Vision. Cham: Springer, 2020: 213-229.
[13]   ZHU X, ZHOU D, YANG D, et al. Deformable DETR: deformable Transformers for end-to-end object detection [C] // International Conference on Learning Representations. Addis Ababa: PMLR, 2020
[14]   PARK H, KIM S, LEE J. Learning dynamic network using a reuse gate function in semi-supervised video object segmentation [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S. l.]: IEEE, 2021: 8405-8414.
[15]   HE L, XIE W, YANG W. End-to-end video object detection with spatial-temporal Transformers [C] // Proceedings of the 29th ACM International Conference on Multimedia. Chengdu: ACM, 2021: 1507-1516.
[16]   HAN Y, LIU Z, YANG M Dynamic neural networks: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 44 (11): 7436- 7456
[17]   GLOROT X, BENGIO Y. Understanding the difficulty of training deep feedforward neural networks [C] // Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. Belgirate: Springer, 2010: 249-256.
[18]   LI X, ZHANG Y, CHEN W Improving video instance segmentation via temporal pyramid routing[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45 (5): 6594- 6601
[19]   LI Y, LIU J, XU M. Learning dynamic routing for semantic segmentation [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S. l.]: IEEE, 2020: 8553-8562.
[20]   SUN P, KUNDU J, YUAN Y. Transtrack: multiple-object tracking with Transformer [EB/OL]//[2023-06-01]. https://doi.org/10.48550/arXiv.2012.15460.
[21]   MEINHARDT T, TEICHMANN M, CIPOLLA R. Trackformer: multi-object tracking with transformers [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 8844-8854.
[22]   HWANG S, LIM S, YOON S Video instance segmentation using inter-frame communication Transformers[J]. Advances in Neural Information Processing Systems, 2021, 34: 13352- 13363
[23]   DAI J, HE K, SUN J. Deformable convolutional networks [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Honolulu, Hawaii: IEEE, 2017: 764-773.
[24]   MILLETARI F, NAVAB N, AHMADI S. V-Net: fully convolutional neural networks for volumetric medical image segmentation [C] // 4th International Conference on 3D Vision. Stanford University: IEEE, 2016: 565-571.
[25]   STEWART R, ANDRILOUKA M, NG A. Y. End-to-end people detection in crowded scenes [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 2325-2333.
[26]   LIN T, GOYAL P, GIRSHICK R, et al. focal loss for dense object detection [C] // Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2980-2988.
[27]   WANG H, CHEN K, WANG K. Max-DeepLab: end-to-end panoptic segmentation with mask transformers [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S. l.]: IEEE, 2021: 5463-5474.
[28]   IOFFE S. SZEGEDY C. Batch normalization: accelerating deep network training by reducing internal covariate shift [C] // International Conference on Machine Learning. Lille: Springer, 2015: 448-456.
[29]   LIN T, MAIRE M, BELONGIE S, et al. Microsoft coco: common objects in context [C] // European conference on Computer Vision. Stockholm: Springer, 2014: 740-755.
[30]   HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.
[31]   FU Y, ZHANG Y, XU Y. Compfeat: comprehensive feature aggregation for video instance segmentation [C] // Proceedings of the AAAI Conference on Artificial Intelligence. [S. l.]: AAAI, 2021, 35(2): 1361-1369.
[32]   JIANG Z, GU Z, PENG J, et al. STC: spatio-temporal contrastive learning for video instance segmentation [C] // European Conference on Computer Vision. Cham: Springer, 2022: 539-556.
[33]   FUJITAKE M, SUGIMOTO A Video sparse Transformer with attention-guided memory for video object detection[J]. IEEE Access, 2022, 10: 65886- 65900
doi: 10.1109/ACCESS.2022.3184031
[34]   WU Y, BUAA K, SUN C. Detectron2 [EB/OL]. [2023-06-01]. https://github.com/facebookresearch/detectron2.2019.
[1] Siyi QIN,Shaoyan GAI,Feipeng DA. Video object detection algorithm based on multi-level feature aggregation under mixed sampler[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(1): 10-19.
[2] Hao-jie FANG,Hong-zhao DONG,Shao-xuan LIN,Jian-yu LUO,Yong FANG. Driver fatigue state detection method based on multi-feature fusion[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(7): 1287-1296.
[3] Xue-yong XIANG,Li WANG,Wen-peng ZONG,Guang-yun LI. Point cloud instance segmentation based on attention mechanism KNN and ASIS module[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(5): 875-882.
[4] Heng YANG,Zhuo LI,Zhong-yuan KANG,Bing TIAN,Qing DONG. Binocular vision object 6D pose estimation based on circulatory neural network[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(11): 2179-2187.
[5] Shu-qin YANG,Yu-hao MA,Ming-yu FANG,Wei-xing QIAN,Jie-xuan CAI,Tong LIU. Lane detection method in complex environments based on instance segmentation[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(4): 809-815, 832.
[6] Ying-jie NIU,Yan-chen SU,Dun-cheng CHENG,Jia LIAO,Hai-bo ZHAO,Yong-qiang GAO. High-speed rail contact network U-holding nut fault detection algorithm[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(10): 1912-1921.
[7] Hai-xiu CHENG,Guan-lin LI,Ling ZHANG. Dynamic resource reservation algorithm for core network video business with bandwidth reduction based on time slot[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(9): 1746-1752.
[8] QIN Hong-chao, LI Yan-yan, LONG Wei, ZHAO Rui-peng. Real-time video dehazing using guided filtering and transmissivity estimated based on dark channel prior theory[J]. Journal of ZheJiang University (Engineering Science), 2018, 52(7): 1302-1309.
[9] GAO Peng-hui, ZHAO Wu-feng, SHEN Ji-zhong. Detection of crowd state mutation based onbackground difference algorithm and optical flow algorithm[J]. Journal of ZheJiang University (Engineering Science), 2018, 52(4): 649-656.
[10] SHA Dong hui, LUO Zhong yang, LU Meng shi, JIANG Jian ping, FANG Meng xiang, ZHOU Dong, CHEN Hao. Electrostatic agglomeration of positively charged particles observed by microscopic visualization system[J]. Journal of ZheJiang University (Engineering Science), 2016, 50(1): 93-101.
[11] YIN Hai-bing, XU Ning. Fast inter mode decision algorithm based on intelligent mode preselection with adaptive offset[J]. Journal of ZheJiang University (Engineering Science), 2014, 48(4): 734-741.
[12] YU Jun, WANG Zeng-fu. Video stabilization based on empirical mode decomposition and
several evaluation criterions
[J]. Journal of ZheJiang University (Engineering Science), 2014, 48(3): 423-429.
[13] LIU Yun-peng, ZHANG San-yuan, WANG Ren-fang, ZHANG Yin. Inter-frame fast coding algorithm in temporal scalability
for traffic video
[J]. Journal of ZheJiang University (Engineering Science), 2013, 47(3): 400-408.
[14] LIU Gao-ping, SONG Zhi-huan. Adaptive network bandwidth control method for
H.264 video stream transmission
[J]. Journal of ZheJiang University (Engineering Science), 2012, 46(12): 2146-2154.
[15] ZHANG Shen,WANG Wei-dong,ZHAO Ya-fei,WU Zu-cheng, WANG Yue-hai,ZHANG Ming. 3D-DCT based volumetric three-dimensional video data compression method[J]. Journal of ZheJiang University (Engineering Science), 2012, 46(1): 112-117.