<img src="https://www.zjujournals.com/eng/images/1008-973X/images/logo.png" class="img-responsive">

图 1 DSDDN的框架

Fig.1 Framework of DSDDN

假设初始视频片段有$ s $帧，分辨率为$ {H}_{0}\times {W}_{0} $，有$ C $个通道，用$\{{\boldsymbol{x}}_{{i}}{\}}_{{i}=1}^{{s}}\in {\bf{R}}^{{s}\times C\times {H}_{0}\times {W}_{0}}$来表示. CNN骨干网络对输入帧进行处理，以加快推理速度. 该片段被编码为一组低分辨率特征，$\{{\boldsymbol{f}}_{{{N}}_{{i}}}{\}}_{{i}=1}^{{s}}\in {\bf{R}}^{{s}\times C\times H\times W}$，$H={{H}_{0}}/{{N}_{i}},W={{W}_{0}}/{{N}_{i}}$.

在实验中，使用YouTube-VIS 2019^[1]作为主要数据集. 它有大约9万张图片，有40个类别，来自近3千种不同的视频. 如图2(a)所示为同一个实例相邻帧的IoU计算. 图中，N_f为帧的个数. IoU越大，表示该实例越静止. 如图2(b) 所示为VIS数据集里包含不同帧数的视频所占的比例. 图2的分析体现了视频中由于实例的慢动作和固定摄像机而导致重叠的严重程度. 可以看出，这些视频存在严重的帧内容重合现象，大约一半的视频有30~39帧. 这些数据为模型在Transformer架构前添加动态采样模块提供了支持. DSDDN的采样模块可以在当前帧和之前帧重复度极高的时候，选择跳过当前帧的推理，直接使用之前帧的结果进行轻量的位移计算. 采用该操作，极大地减小了帧的计算成本，使得模型在效率和准确性方面得到了更好的平衡.

图 2

图 2 YouTube-VIS 2019数据集的特征分析

Fig.2 Feature analysis of YouTube-VIS 2019 dataset

2.2. 动态采样优化器

本文的目标是减少VIS任务中对高度相似帧的重复计算. 为了解决该问题，提出DSO. 如图3所示为2个连续帧的特征图之间的差异图，${\boldsymbol{D}}_{{t}}\in {\bf{R}}^{C\times H\times W}$，分辨率与$ {\boldsymbol{f}}_{{{N}}_{{t}}} $相同. $\boldsymbol{D}_{t}$被转发给重用门函数，以决定是否跳过这一帧. 类似于文献[13]，使用由2个卷积层、2个最大池化层和1个sigmoid函数组成的门函数. 重用门的概率为

图 3

图 3 基线方法与动态采样对偶可变形网络方法的帧可视化结果

Fig.3 Visual results of frames using baseline method and dynamic sampling dual deformable network

(1)$ {P_\text{gate}} = \sigma \left( {{\omega _2} {M_{\text{p}{_2}}}({\omega _1} {M_\text{p1}}({\boldsymbol{f}_{{N_t}}}))} \right) . \\$

$ \mathrm{式}\mathrm{中}：\omega $和${M}_\text{p}$分别为一个层中的卷积权重和maxpooling操作. 门函数为

(2)$ {G_t} = \left\{ \begin{gathered} {\text{1 (重用), }}{{{P}}_\text{gate}} \gt \tau ； \\ {\text{0 (不重用), 其他.}} \\ \end{gathered} \right. \\$

式中：$ \tau $为控制重用门的阈值.

若重用门开启，则DSO会计算偏移图$\boldsymbol{\varDelta }_{{t}}$，其中包含2个连续帧之间的像素级转换信息. 通过将偏移图${\boldsymbol{\varDelta }}_{\text{r}\text{e}\text{u}\text{s}\text{e}}$加入到前一帧的分数图中，重用前一帧的结果，推断当前帧的分割结果.

若门是关闭的，则意味着当前帧与前一帧有很大的不同 (相似度较低) . 针对这类运动速度较大的难样本，本文希望从辅助帧中获得更丰富的信息. 对于当前时间$ t $，定义了参考跨度$ s $，这意味着本文对当前帧${\boldsymbol{f}}_{{t}}$的推理，使用帧$\left[{\boldsymbol{f}}_{{t}-{s}}，{\boldsymbol{f}}_{{t}}\right]$进行特征聚合辅助推理. DSO中的参考步幅预测器会根据相似度得分，动态地调整$ {s}_{t} $. 相似性分数与衡量实例运动速度的IoU分数相似. 利用预测器计算$ {s}_{t} $的方法如下:

(3)$ {s_t} = \left\{ \begin{gathered} {s_\text{fast}},\;{{相似性分数 \lt }}{\tau _1} ； \\ {s_\text{middle}},{\text{ }}{\tau _1} \lt {\text{相似性分数 }} \lt {\tau _2}； \\ {s_\text{slow}},{\text{ 其他.}} \\ \end{gathered} \right. \\ $

本文默认设置$ {s}_{\mathrm{s}\mathrm{l}\mathrm{o}\mathrm{w}} $ = 3，$ {s}_{\mathrm{m}\mathrm{i}\mathrm{d}\mathrm{d}\mathrm{l}\mathrm{e}} $ = 5，$ {s}_{\mathrm{f}\mathrm{a}\mathrm{s}\mathrm{t}} $ = 10.

对于模型的尺寸，在DSO部分，为了保证实例匹配模块的准确性，特征图${\boldsymbol{f}}_{{N}_{t}}$较大.

2.3. 双变形Transformer

选择可变形DETR作为检测器基础，但只使用它的最后阶段，而不使用多尺度特征表示. 本文的DDT模块 (见图4) 由2个Transformer架构组成：空间级Transformer (可变形特征增强Transformer编码器和空间Transformer解码器) 和时间级Transformer (交叉注意力Transformer编码器、可变形Transformer编码器和可变形Transformer解码器) . 图4中，左边虚线框表示空间级别的Transformer，右边表示时间级别的Transformer. 空间变形操作和时间变形操作解耦，可以并行执行，以加速整个模型的整体运行速度. 使用空间和时间2个可变形操作来优化结构，所以称该块为双变形Transformer.

图 4

图 4 双变形Transformer的架构

Fig.4 Framework of dual deformable Transformer

在空间级Transformer中，使用可变形特征增强Transformer编码器 (DFETE) ，它增加了类似于文献[23，15]的偏移图，增强和压缩从DSO模块里得到的特征图$ {{\boldsymbol{f}}}_{{N}_{t}} $到${{\boldsymbol{f}}}_{32_{t}}$，对${\left\{{\boldsymbol{f}}_{32_{t}}\right\}}_{{t}=1}^{{s}}$进行编码，以获得特征编码${\left\{{\boldsymbol{E}}_{{{{i}}}}\right\}}_{{i}=1}^{{s}}$ (见图4的密虚线箭头) . ${\left\{{\boldsymbol{E}}_{{i}}\right\}}_{{i}=1}^{{s}}$通过空间转换解码器 (STD) ，获得对象查询${\left\{{\boldsymbol{Q}}_{{i}}\right\}}_{{i}=1}^{{s}}$ (见图4的稀疏虚线箭头) . 通过在编码器模块中使用类似于可变形卷积的偏移图，减少背景噪声对Transformer检测器的影响，提高了提取空间信息的准确性. 具体来说，空间变形操作和时间变形操作解耦，可以并行执行，以加速整个模型的整体运行速度.

在时空转换器中，本文的目标是通过时空转换器架构将每一帧的空间细节联系起来. 考虑特征编码${\left\{\boldsymbol{E}_{i}\right\}}_{i=1}^{s}$和对象查询${\left\{\boldsymbol{Q}_{i}\right\}}_{i=1}^{s}$之间的差异，使用2种不同的编码器分别进行编码. 可变形Transformer编码器 (DTE) 对${\left\{{\boldsymbol{E}}_{{i}}\right\}}_{{i}=1}^{{s}}$进行编码，为当前帧提供位置线索. DTE取代了全注意力层，使用时态可变注意力层^[14]，它只注意参照物周围的一小部分关键采样点. 本文可以有效地连接几个帧的特征编码. 让$ q $索引具有内容特征$\boldsymbol{z}_{q}$和二维参考点$\boldsymbol{p}_{1}$的查询元素，DTE中的多头可变形注意力表述如下:

(4)$ \begin{split} &\text{TDAttn}\left( {{\boldsymbol{z}_q},{{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\boldsymbol{p}} }_q},\left\{ {{\boldsymbol{x}_i}} \right\}_{i = 1}^s} \right) = \\&\sum\limits_{m = 1}^M {{\boldsymbol{W}_m}\left[ {\sum\limits_{i = 1}^s {\sum\limits_{k = 1}^K {{A_{mpki}}{{\boldsymbol{W}^{\rm{T}}_m}}{\boldsymbol{x}^i}\left( {{\phi _i}\left( {{{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\boldsymbol{p}} }_q}} \right)+\Delta {{\boldsymbol{p}}_{mqki}}} \right)} } } \right]} .\end{split} $

$ \mathrm{式}\mathrm{中}：m $为注意头的数量，下标$ i $表示第i个输入帧，$ k $为采样键值，$ K $为总的采样键数 ($ K < < HW $) ，$\boldsymbol{{{\rm{\Delta}} }}{\boldsymbol{p}}_{{m}{q}{k}{i}}$和${\boldsymbol{A}}_{{m}{q}{k}{i}}$分别为第i个特征图中的第k个采样点和第m个注意力头的采样偏移和注意力权重，W_m为注意力层的权重矩阵. 此外，${\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\boldsymbol{p}}}_{q}\in {\left[\mathrm{0,1.0}\right]}^{2}$和${\displaystyle \sum }_{i=1}^{s}{\displaystyle \sum }_{k=1}^{K}{A}_{mpki}=1$，这2个参数都进行了归一化. 使用函数${\phi }_{i}\left({\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\boldsymbol{p}}}_{q}\right)，$ 将${\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\boldsymbol{p}}}_{q}$重新划分为第$ i $个特征图.

交叉注意力Transformer编码器 (CATE) 使用空间查询作为输入，学习不同帧的时间背景. 本文遵循类似于文献[14]的从粗到细的空间对象查询聚合策略，选择最确信的k反馈给交叉注意力层. CATE和DTE的输出都被送到变形Transformer解码器 (DTD) 中，以获得当前帧的结果作为最终输出.

2.4. 实例匹配和损失函数

使用Ground Truth和预测结果之间的实例匹配差距，优化DSO和DDT. 该目标可以被正式描述为

(5)$ {\varDelta _\text{match}} = \text{match}\left( { \cdot , \;\cdot } \right) . $

$\mathrm{式}\mathrm{中}：\text{match}\left(\cdot ,\cdot \right)$为2个参数之间进行相似度测量的函数.

在DSO中，产生2个连续帧的特征图之间的差异图，${\boldsymbol{D}}_{{t}}\in {\bf{R}}^{C\times H\times W}$. ${\boldsymbol{D}}_{{t}}$分辨率与${\boldsymbol{f}}_{{{N}}_{{t}}}$相同. ${\boldsymbol{D}}_{{t}}$被转发给重用门函数，以决定是否跳过此帧. 若重用门打开，则DSO从$\boldsymbol{D}_{t}$生成1个偏移图${\boldsymbol{\varDelta }}_{{t}}$，表示2帧之间的像素级变化. 通过使用${\boldsymbol{\varDelta }}_{{t}}$和之前的分数图$ {S}_{t-1} $，可以获得当前帧的分割. 本文的目标是减少当前帧的Ground Truth掩码$ {y}_{t} $与之前的得分图$ {S}_{t-1} $之间的不相似度.

$ \varDelta $的损失描述如下：

(6)$ {L_\varDelta } = {\lambda _1}(1 - \text{DICE}({\boldsymbol{\varDelta} _t},{y_{{N_{{t}}}}} - {S_{t - 1}})) . $

式中：$ {y}_{{N}_{t}} $为重新扫描的Ground Truth掩码，分辨率与之前的得分图$ {S}_{t-1} $相同；$ \mathrm{D}\mathrm{I}\mathrm{C}\mathrm{E} $表示DICE系数^[24].

本文希望网络在实现高分割精度的同时，最大限度地减少计算资源的消耗. 为了解决该问题，Zhu等^[13]引入0~$ {m}_{1} $的变量$ m $，增大相邻帧之间的IoU. 文献[13]的参数人工设定的，这不稳定，可能会导致性能的急剧下降. 使用$ \mathrm{m}\mathrm{a}\mathrm{x} $函数来控制${P}_{\mathrm{g}\mathrm{a}\mathrm{t}\mathrm{e}}$的规模，其中$\text{IoU}_{t}^{t-1}$为当前帧与前一帧之间的IoU. 新的门控概率损失函数具体计算方法如下:

(7)$ {P_\text{target}} = \text{IoU}_t^{t - 1} ， $

(8)$ {L_\text{gate}} = \left(\text{max}\left\{ {{\lambda _2}{{{P}}_\text{gate,}}|P_{\rm{gate}}-P_{\rm{target}}|} \right\} \right)^2. $

式中：${P}_{{\text{target}}}$为门控概率.

整个损失函数如下所示：

(9)$ {L_\text{DSO}} = {L_\varDelta }+{L_\text{gate}} . $

考虑双变形Transformer部分的损失函数. 利用本文模型，生成固定大小的按照类别区分开的掩码集合

$ {\left\{{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\boldsymbol{y}}}_{{i}}\right\}}_{{i}=1}^{{{N}}_{{q}}}={\left\{\left({\boldsymbol{c}}_i',{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\boldsymbol{m}}}_{{i}}\right)\right\}}_{{i}=1}^{{{N}}_{{q}}}， $

Ground Truth集合可以表示为${\boldsymbol{y}}_{{i}}=\left({{{\boldsymbol{c}}}}_{{i}},{{{\boldsymbol{m}}}}_{{i}}\right)$，$ {{\boldsymbol{c}}}_{i} $为包含空集在内的类标签，$ {{\boldsymbol{c}}}_{i}{{{'}}} $为预测的类别，$\boldsymbol{m}_{i}$为目标掩码，这个集合要被下采样以提高实例匹配的效率. 利用下式开展${\left\{{\overset{{\smash{\scriptscriptstyle\frown}}}{\boldsymbol{y}}}_{{i}}\right\}}_{{i}=1}^{{{N}}_{{q}}}$和${\left\{{\overset{{}}{\boldsymbol{y}}}_{{i}}\right\}}_{{i}=1}^{{K}}$的双位匹配：

(10)$ \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma } = \mathop {\arg \max }\limits_{\sigma \in {M_{{N_q}}}} \sum\limits_{i = 1}^K {\text{match}\left( {{\boldsymbol{y}_i},{{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\boldsymbol{y}} }_{\sigma \left( i \right)}}} \right)} . $

式中：$M_{N_q} $为$1,2, \cdots , N_q $的所有可能排列. 根据文献[11,12,22,25]，采用Hungarian算法计算双位匹配. DDT的损失为

(11)$\begin{split} {L_\text{pos}} =& {L_\text{CEL}}+{L_\text{DL}}+{L_\text{SFL}} = \\& \sum\limits_{i = 1}^K {\left[ { - \ln {{\hat P}_{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma } \left( i \right)}}({\boldsymbol{c}}_i)+{\lambda _3}\left( {1 - \text{DICE}\left( {{\boldsymbol{m}_i},{{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\boldsymbol{m}} }_{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma } \left( i \right)}}} \right)} \right)} \right.} + \\ &\left.{\lambda _4}\text{FOCAL}\left( {{\boldsymbol{m}_i},{{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\boldsymbol{m}} }_{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma } \left( i \right)}}} \right) \right] ， \\[-18pt]\end{split}$

(12)$ {L_{{\rm{neg}}}} = \sum\limits_{i = k+1}^{{N_q}} {\left[ { - \ln {{\hat P}_{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma } \left( i \right)}}\left( \varnothing \right)} \right]} .\\[-3pt] $

式中：$ {L}_{\mathrm{p}\mathrm{o}\mathrm{s}} $用来提升非空类别的分类和分割精度，$ {L}_{\mathrm{n}\mathrm{e}\mathrm{g}} $用来优化对$ \mathrm{\varnothing } $类别的预测，${L}_\text{CEL}$、${L}_\text{DL}$、 ${L}_\text{SFL}$ 分别为交叉熵损失、在${\left\{{\overset{{\smash{\scriptscriptstyle\frown}}}{{\boldsymbol{y}}}}_{i}\right\}}_{i=1}^{{N}_{q}}$和$ {{\boldsymbol{y}}}_{i} $中预测的DICE 损失^[24]和sigmoid-focal^[26]损失，${{\hat P}_{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma } \left( i \right)}}({\boldsymbol{c}}_i) $为索引为${\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma } \left( i \right)} $时类别为${\boldsymbol{c}}_i $的概率，${{\hat P}_{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma } \left( i \right)}}\left( \varnothing \right)$为索引为${\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\sigma } \left( i \right)} $时类别为$\varnothing $的概率.

受文献[22]的启发，在DETR中选择基于掩码（mask）的测量方法，而不是基于盒子 (box) 的方法，以达到更好的准确性.

2.5. 输出头

在DTD上设置3个输出头、1个分类头、1个掩码头和1个追踪头来处理VIS任务. 预测$ {N}_{q} $类 ${\left\{{\boldsymbol{c}}_{{i}}{{{'}}}\right\}}_{{i}=1}^{{{N}}_{{q}}}\in {\bf{R}}^{{N}_{q}\times \left|C\right|}$有2个全连接层和1个softmax函数. 使用另外2个全连接层，通过重复使用解码器的输出，预测掩码特征${\boldsymbol{M}}_{{f}}\in {\bf{R}}^{{N}_{q}\times D}$. 产生归一化的特征${\boldsymbol{f}}_{{{N}}_{{i}}}\in {\bf{R}}^{{N}_{q}\times {H}{{{'}}}\times {W}{{{'}}}}$，其中${H}{{{'}}}={{H}_{0}}/{{N}_{i}}, W'={{W}_{0}}/{{N}_{i}}$. 掩码预测$\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\boldsymbol{m}}$可以被描述为${\boldsymbol{M}}_{{f}}$和${\boldsymbol{f}}_{{{N}}_{{t}}}$的乘法:

(13)$ \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\boldsymbol{m}} = \text{soft}{\max _{{N_q}}}\left( {{\boldsymbol{M}_f} \cdot {\boldsymbol{f}_{{N_t}}}} \right) \in {{\bf{R}}^{{N_q} \times H' \times W'}} .$

沿用文献[27]的设置，在${\boldsymbol{M}}_{{f}}$和$\left({\boldsymbol{M}}_{{f}}\cdot {\boldsymbol{f}}_{{{N}}_{{t}}}\right)$上使用批规范^[28]，避免刻意的初始化.

增加精心设计的追踪头，跟踪跨视频帧的实例. 对于当前帧$ t $，假设有一个检测到的候选实例${\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\boldsymbol{y}}}_{i}$以及参考帧中${\left\{{\boldsymbol{I}}_{i}\right\}}_{i=1}^{N}$已经识别的实例 (由训练期间的掩码标签${\left\{{\boldsymbol{y}}_{i}\right\}}_{i=1}^{N}$提供) 作为目标. ${\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\boldsymbol{y}}}_{i}$被分配为${\boldsymbol{I}}_{i}$中的某一个类别或者产生一个新的类别.

使用${\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\boldsymbol{y}}}_{i}$的实例嵌入，计算分配给每个标签的概率. 与文献[3]类似，采用一组可学习的实例权重$[{\boldsymbol{\omega} }_{1},{{\boldsymbol{\omega}} }_{2},\cdots ,{{\boldsymbol{\omega}} }_{n}]$来处理训练期间的波动，其中$ n $为数据集中的类别数.

为了处理大规模的数据集 (如YouTube-VIS 2019训练集) ，使用焦点损失来优化追踪头. 将标签$ n $分配给${\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{{\boldsymbol{y}}}}_{i}$的概率定义为

(14)$ {P_i}\left( n \right) = \frac{{\exp \left( {{\boldsymbol{e}}_i^\text{T}{{\boldsymbol{\omega}} _n}} \right)}}{{\displaystyle \sum\limits_{j = 1}^I {\exp \;({\boldsymbol{e}}_i^\text{T}{{\boldsymbol{\omega}} _j})} }} .\\[-3pt] $

式中：${\boldsymbol{e}}_i $为第i个帧里的每个实例的嵌入编码.这个追踪头的损失函数如下.

(15)$ {P_i}\left( n \right) = \left\{ \begin{gathered} \text{sigmoid}\;({\boldsymbol{e}}_i^{\rm{T}}{{\boldsymbol{\omega}} _n})，{\text{ }}{{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{{\boldsymbol{y}}} }_i} \in {\rm{labe}}{{\rm{l}}_n} ； \\ 1 - {\rm{sigmoid}}\;({\boldsymbol{e}}_i^{\rm{T}}{{\boldsymbol{\omega}} _n})，{\text{ 其他.}} \\ \end{gathered} \right .\\[-3pt]$

(16)$ {L_{{\rm{track}}}} = - {\alpha _t}{\left( {1 - {P_i}\left( n \right)} \right)^\gamma }\ln {{P_i}\left( n \right)} .\\[-3pt] $

式中：$ {\alpha }_{t} $和$ \gamma $遵循文献[26]的定义，设置为$ {\alpha }_{t} $=0.25，$ \gamma $= 2.

从式（8）、（11）、（12）、（16）可以看出，最终的损失是${L}_{{\rm{DSO}} }$、$ {L}_{{\rm{pos}}} $、$ {L}_{{\rm{neg}}} $、$ {L}_{{\rm{track}}} $的总和，如下所示：

(17)$ L = {\lambda _{{\rm{DSO}}}}{L_{{\rm{DSO}}}}+{\lambda _{{\rm{pos}}}}{L_{{\rm{pos}}}}+ {\lambda _{{\rm{neg}}}}{L_{{\rm{neg}}}}+{\lambda _{{\rm{track}}}}{L_{{\rm{track}}}} .\\[-3pt] $

用$ \lambda $来控制每个部分对整个模型优化的影响.

3. 实验分析

3.1. 实验细节

3.1.1. 数据集和训练设置

本文的方法DSDDN在2个具有挑战性的视频实例分割基准数据集上进行测试，即 YouTube-VIS 2019和YouTube-VIS 2021^[2]，以衡量其性能. 其中2019年版本中有2883个视频、40个类别，2021年版本中有3800多个视频.

使用原始的Deformable DETR^[12]作为代码基础，非特殊强调，大部分的超参数设置遵循IFC^[22]. 在COCO^[29]数据集对模型预训练150个轮次，使用AdamW作为优化器，设置初始化Transformer学习率为$ 2\times 1{0}^{-4} $，backbone学习率为$ 2\times 1{0}^{-5} $，权重衰减系数为$ 1{0}^{-4} $. 所有的Transformer参数使用Xavier初始化. 所有backbone使用在ImageNet上预训练的参数初始化，冻结batchnorm层. 使用ResNet-50^[30]和ResNet-101^[30]2个CNN网络作为backbone，通过模型后缀50和101来进行区分. 经过预训练后，模型在目标数据集上进行训练，此时设置batch size为16，降采样尺寸为360×640. 在目标数据集上训练的轮次数根据backbone不同而有所区别，模型在ResNet-50上训练网络的轮次为10，并从第8个开始降低学习率，在ResNet-101上训练12个轮次，并从第10个开始衰减.

3.1.2. 实时推理

在推理过程中，遵循与训练相同的比例设置. 输入视频被缩小到360像素，以减少计算费用. 该模型只提取前10个实例的预测，因为YouTube-VIS数据集中的大多数帧都不超过10个实例. 受文献[1]的启发，本文发现只基于式(14)进行跨视频的实例追踪是不稳定的. 本文为实时关联推理加入了一些线索，在测试阶段，将标签$ n $分配给$ {\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{{\boldsymbol{y}}}}_{i} $的分数计算如下:

(18)$ {v}_{i}\left(n\right)= \left({\rm{ln}}\;{P}_{i}\left(n\right)+\alpha \text{IoU}\left({{\boldsymbol{m}}}_{i},{{\boldsymbol{m}}}_{n}\right)\right) \delta \left({{\boldsymbol{c}}}_{i},{{\boldsymbol{c}}}_{n}\right) . \\[-3pt]$

$ \mathrm{式}\mathrm{中}： $IoU为$ {\boldsymbol{m}}_{i} $和$ {{\boldsymbol{m}}}_{n} $之间的IoU；$ \delta \left({{\boldsymbol{c}}}_{i},{{\boldsymbol{c}}}_{n}\right) $为Kronecker delta函数，当$ {{\boldsymbol{c}}}_{i}={{\boldsymbol{c}}}_{n} $时$ \delta \left({{\boldsymbol{c}}}_{i},{{\boldsymbol{c}}}_{n}\right)= $1，否则$ \delta \left({{\boldsymbol{c}}}_{i},{{\boldsymbol{c}}}_{n}\right)= $0. 本文用$ \alpha $来调整IoU的影响.

3.2. 结果分析

YouTube-VIS 2019 验证结果如表1所示，带*的方法表示离线模型，它们使用更多的参考信息，具有显著的推理延迟. 表中，mAP(mean average precision)为平均精度的平均值，AP50、AP75分别为IoU阈值为50%和75%时的平均精度. 将提出的方法与现有的具有不同骨架的最先进的VIS方法 (如ResNet 50和ResNet 101) 进行比较. 使用mAP指标来进行模型对比. mAP是常用于评估目标检测、实例分割、全景分割等计算机视觉任务性能的指标. mAP结合准确率（precision）和召回率（recall），通过计算在不同置信度阈值下的平均精度来衡量模型的性能. 设计DSDDN，它能够动态采样视频帧，在pipeline里增加对偶可变形操作. 通过AP来衡量准确性，利用本文的实时方法得到了有竞争力的分数. 这表明利用提出的方法可以准确地检测类别、分割实例掩码，并在帧上跟踪物体. 特别地是，若要采用输入为Clip而不是逐帧输入的方法，则须减小Clip的长度T，使得Clip方法接近于实时逐帧推理. 根据文献[22]可知，在IFC^[22]中设置T = 5，在MaskProb^[8]中设置T = 13，这对实时推理来说存在很小的延迟.

表 1 基于YouTube-VIS 2019验证集的视频实例分割方法的比较

Tab.1 Comparisons of video instance segmentation on YouTube-VIS 2019 validation dataset

方法	mAP/%	AP50/%	AP75/%
MaskTrack R-CNN 50^[8]	30.3	51.1	32.6
MaskTrack R-CNN 101^[8]	41.8	53.0	33.6
MaskProp 50 ^[10]	40.0	—	42.9
MaskProp 101 ^[10]	42.5	—	45.6
*VisTR 50^[11]	36.2	59.8	36.9
*VisTR 101 ^[11]	40.1	64.0	45.0
CrossVIS 50 ^[3]	36.3	56.8	38.9
CrossVIS 101^[3]	36.6	57.3	39.7
CompFeat 50 ^[31]	35.3	56.0	38.6
*IFC 50^[22]	41.0	62.1	45.4
STC ^[32]	36.7	57.2	38.6
VSTAM^[33]	39.0	62.9	41.8
SipMask 50^[2]	33.7	54.1	35.8
DSDDN 50	37.5	59.1	41.9
DSDDN 101	39.1	60.7	43.5

遵循表1的相同设置，使用detectron2^[34]来测量推理速度. 除非特别说明，所有用于测量的模型都使用ResNet-50作为CNN主干，将帧尺寸下采样为360像素. DSDDN使用ResNet-50，推理速度v达到41.2帧/s，mAP达到41.5%，如表2所示，作为离线模型，在VisTR模型里，设置T = 36. 该设置显著增大了它的推理速度. 与其他实时和近似实时的方法相比，DSDDN获得了更好的表现. 本文方法在所有实时和接近实时的方法中具有竞争力.

表 2 基于YouTube-VIS 2019验证集的效率比较

Tab.2 Efficiency comparisons on YouTube-VIS-2019 validation set

方法	类型	v/(帧·s⁻¹)	mAP/%
MaskTrack R-CNN^[8]	online	32.8	30.3
CrossVIS^[3]	online	39.8	34.8
VisTR^[11]	offline	51.1	36.2
CompFeat^[31]	online	32.8	35.3
SipMask ^[2]	online	35.5	33.7
STEm-Seg^[35]	Near Online	4.40	34.6
DSDDN	online	40.2	37.5

YouTube-VIS 2021是YouTube-VIS 2019的补充版，完善了视频的类别，扩大了数据集的样本数量. 使用官方实现的方法进行评估. 表3的结果显示，本文方法超过了大多数最先进的方法.

表 3 基于YouTube-VIS 2021验证集的精度比较

Tab.3 Accuracy comparison based on YouTube-VIS 2021 validation set

方法	mAP/%	AP50/%	AP75/%
MaskTrack-RCNN ^[9]	28.6	48.9	29.6
SipMask ^[2]	31.7	52.5	34.0
CrossVIS^[3]	34.2	54.4	37.9
IFC ^[22]	36.6	57.9	39.3
DSDDN	34.8	55.9	37.4

3.3. 消融学习

在YouTube-VIS 2019 验证集上进行实验，使用ResNet-50作为骨干.

如表4所示，评估本文方法的组成部分的影响，包括DSO、DDT及精心设计的输出头. 表中，t_tr为训练时间. 具体来说，本文设计DSO来提高实时推理速度，这可能会导致预测精度下降. 变压器的优化导致训练速度更快. 为了更全面地反映每个组件对模型的影响，考虑推理速度和训练时间2个关键的衡量指标.

表 4 DSO和DDT的消融实验

Tab.4 Ablation experiment on DSO and DDT

DSO	DDT	输出头	v/(帧·s⁻¹)	t_tr/h	mAP/%
—	—	—	31.2	5000	36.5
√	—	—	43.1	5100	35.1
√	√	—	41.5	1050	36.7
√	√	√	40.2	1100	37.5

3.3.1. 动态采样优化器的设置

在DSO中，有几个人工参数会显著影响准确性和推理速度. 开展消融实验，研究重用门函数中采样步长$ s $和阈值$ \tau $对模型效果的影响. 基于图3对数据集的特征分析，设定$ \tau =0.6\mathrm{、}0.8\mathrm{和}1.0 $. 如表5所示，在精度和计算开销中寻求平衡点.

表 5 关于重用门函数阈值$ \boldsymbol{\tau } $的消融实验结果

Tab.5 Ablation experiment results of threshold $ \tau $ sets in reuse gate function

$ \boldsymbol{\tau } $	v/(帧·s⁻¹)	mAP/%	AP50/%	AP75/%
1.0	29.7	38.7	59.9	43.2
0.8	40.1	37.5	60.1	43.7
0.6	52.4	35.1	56.3	39.2

固定门函数的部分，如表6所示为使用不同$ s $时的$ \mathrm{m}\mathrm{A}\mathrm{P} $. 可以看出，增加跨度会导致$ \mathrm{m}\mathrm{A}\mathrm{P} $逐渐增加到最大值，然后保持相对稳定的值，本文选择最佳值来寻找$ \mathrm{m}\mathrm{A}\mathrm{P} $和计算成本之间的平衡. 考虑在重用门开启时处理当前帧的3种策略：1）使用前一帧的复制；2）使用位移图通过简单计算生成分割结果；3）直接复制和位移图生成的混合方式. 在混合部分设置额外的阈值$ {\tau }_{2} $，它大于门函数的默认阈值$ \tau $ =0.8. 当概率大于$ {\tau }_{2} $时，DSO复制之前帧的结果；当概率大于阈值$ \tau $而小于$ {\tau }_{2} $时，DSO利用位移图来计算当前帧的结果. 表7显示，在跳过复杂的推理后，利用本文方法能够最大可能地减少跳过部分计算产生的精度下降.

表 6 基于YouTube-VIS 2019数据集的采样步幅的消融实验结果

Tab.6 Ablation experiment results of sampling stride based on YouTube-VIS 2019 set

s	mAP/%	s	mAP/%
1	31.3	10	37.3
5	36.7	15	37.9

表 7 采用不同方法对精度均值和推理速度的影响

Tab.7 Influence of using different method on mAP and inference speed

方法	mAP/%	v/(帧·s⁻¹)
复制	35.3	47.1
位移图	38.2	36.6
混合	37.4	40.3

3.3.2. 双变形Transformer和输出头的设置

表8总结了Transformer块 (包括DTE、CATE和DTD) 中不同层数的mAP. 表中，下标DTE、CATE、DTD表示修改对应模块的层数. 考虑到层数增加带来的精度和计算成本提高，使用3个编码器层和3个解码器层.

表 8 关于Transformer块层数的消融学习

Tab.8 Ablation study on number of layers in transformer blocks

层数	mAP_DTE/%	mAP_CATE/%	mAP_DTD/%
1	34.7	36.5	35.2
2	36.1	36.9	37.7
3	36.5	37.3	37.1
4	36.6	37.7	37.3
5	36.3	35.9	37.5
6	36.6	35.4	37.4

如表9所示，开展消融实验，证明不同线索对模型精度的影响. 可以看到，使用增加了IoU和类别一致性的检测置信度，可以显著提高整体的精度.

表 9 使用不同线索对追踪头精度的影响

Tab.9 Influence of using different cues on track head

检测可信度	loU	类别一致性	mAP/%
√	—	—	35.7
√	√	—	36.6
√	—	√	36.1
√	√	√	37.4

4. 结　语

本文提出DSDDN方法. 这是用于视频实例分割任务的基于注意力的模型，由动态采样优化器和双变形Transformer 2个核心部分组成. DSDDN在基于DETR的结构中加入对偶可变形操作，能够在增强空间-时间信息聚合的同时，避免指数级别的计算复杂度，显著地加快训练速度. 本文通过基准数据集上的详细实验，验证了提出方法DSDDN的有效性，在公开的YouTube-VIS 2019验证集上的mAP是39.5%. 本文的实时方法以较快的推理速度(40.2 帧/s) 超越了baseline方法.

下一步的研究方向是将模型从逐帧输入输出改为完整视频输入输出，这样可以增加模型对时空信息的复用，显著提高模型的推理速度.

参考文献

原文顺序

文献年度倒序

文中引用次数倒序

被引期刊影响因子

[1]

YANG L, FAN Y, XU N. Video instance segmentation [C] // Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 5188-5197.

[本文引用: 6]

[2]

CAO J, WU X, SHEN C. Sipmask: spatial information preservation for fast image and video instance segmentation [C] // European Conference on Computer Vision. Glasgow: Springer, 2020.

[3]

YANG S, ZHOU L, HUANG Q. Crossover learning for fast online video instance segmentation [C] // Proceedings of the IEEE/CVF International Conference on Computer Vision. [S. l.]: IEEE, 2021: 8043-8052.

[4]

LIU D, HUANG Y, YU J. SG-Net: spatial granularity network for one-stage video instance segmentation [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S. l.]: IEEE, 2021: 9816-9825.

[5]

HE K, GAURAV G, ROSS G. Mask R-CNN [C]//Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2961-2969.

[6]

BOLYA D, WANG C, JIA Y. Yolact: real-time instance segmentation [C] // Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 9157-9166.

[7]

TIAN Z, SHEN C, CHEN H. Conditional convolutions for instance segmentation [C]// European Conference on Computer Vision. Glasgow: Springer, 2020: 282–298.

[8]

CHEN H, ZHANG X, YUAN L. BlendMask: top-down meets bottom-up for instance segmentation [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S. l.]: IEEE, 2020: 8573-8581.

[9]

BERTASIUS G, TORRESANI L. Classifying, segmenting, and tracking object instances in video with mask propagation [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S. l.]: IEEE, 2020: 9739-9748.

[10]

VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C] // Advances in Neural Information Processing Systems. Los Angeles: Curran Associates, 2017: 5998-6008.

[本文引用: 4]

[11]

WANG Y, FAN Y, XU N. End-to-end video instance segmentation with transformers [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S. l.]: IEEE, 2021: 8741-8750.

[12]

CARION N, TOUVRON H, VEDALDI A. End-to-end object detection with transformers [C] // European Conference on Computer Vision. Cham: Springer, 2020: 213-229.

[本文引用: 6]

[13]

ZHU X, ZHOU D, YANG D, et al. Deformable DETR: deformable Transformers for end-to-end object detection [C] // International Conference on Learning Representations. Addis Ababa: PMLR, 2020

[本文引用: 7]

[14]

PARK H, KIM S, LEE J. Learning dynamic network using a reuse gate function in semi-supervised video object segmentation [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S. l.]: IEEE, 2021: 8405-8414.

[本文引用: 3]

[15]

HE L, XIE W, YANG W. End-to-end video object detection with spatial-temporal Transformers [C] // Proceedings of the 29th ACM International Conference on Multimedia. Chengdu: ACM, 2021: 1507-1516.

[本文引用: 3]

[16]

HAN Y, LIU Z, YANG M

Dynamic neural networks: a survey

[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 44 (11): 7436- 7456

[17]

GLOROT X, BENGIO Y. Understanding the difficulty of training deep feedforward neural networks [C] // Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. Belgirate: Springer, 2010: 249-256.

[18]

LI X, ZHANG Y, CHEN W

Improving video instance segmentation via temporal pyramid routing

[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45 (5): 6594- 6601

[19]

LI Y, LIU J, XU M. Learning dynamic routing for semantic segmentation [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S. l.]: IEEE, 2020: 8553-8562.

[20]

SUN P, KUNDU J, YUAN Y. Transtrack: multiple-object tracking with Transformer [EB/OL]//[2023-06-01]. https://doi.org/10.48550/arXiv.2012.15460.

[21]

MEINHARDT T, TEICHMANN M, CIPOLLA R. Trackformer: multi-object tracking with transformers [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 8844-8854.

[22]

HWANG S, LIM S, YOON S

Video instance segmentation using inter-frame communication Transformers

[J]. Advances in Neural Information Processing Systems, 2021, 34: 13352- 13363

[本文引用: 8]

[23]

DAI J, HE K, SUN J. Deformable convolutional networks [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Honolulu, Hawaii: IEEE, 2017: 764-773.

[24]

MILLETARI F, NAVAB N, AHMADI S. V-Net: fully convolutional neural networks for volumetric medical image segmentation [C] // 4th International Conference on 3D Vision. Stanford University: IEEE, 2016: 565-571.

[25]

STEWART R, ANDRILOUKA M, NG A. Y. End-to-end people detection in crowded scenes [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 2325-2333.

[26]

LIN T, GOYAL P, GIRSHICK R, et al. focal loss for dense object detection [C] // Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2980-2988.

[27]

WANG H, CHEN K, WANG K. Max-DeepLab: end-to-end panoptic segmentation with mask transformers [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S. l.]: IEEE, 2021: 5463-5474.

[28]

IOFFE S. SZEGEDY C. Batch normalization: accelerating deep network training by reducing internal covariate shift [C] // International Conference on Machine Learning. Lille: Springer, 2015: 448-456.

[29]

LIN T, MAIRE M, BELONGIE S, et al. Microsoft coco: common objects in context [C] // European conference on Computer Vision. Stockholm: Springer, 2014: 740-755.

[30]

HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.

[31]

FU Y, ZHANG Y, XU Y. Compfeat: comprehensive feature aggregation for video instance segmentation [C] // Proceedings of the AAAI Conference on Artificial Intelligence. [S. l.]: AAAI, 2021, 35(2): 1361-1369.

[32]

JIANG Z, GU Z, PENG J, et al. STC: spatio-temporal contrastive learning for video instance segmentation [C] // European Conference on Computer Vision. Cham: Springer, 2022: 539-556.

[33]

FUJITAKE M, SUGIMOTO A

Video sparse Transformer with attention-guided memory for video object detection

[J]. IEEE Access, 2022, 10: 65886- 65900

DOI:10.1109/ACCESS.2022.3184031 [本文引用: 1]

[34]

WU Y, BUAA K, SUN C. Detectron2 [EB/OL]. [2023-06-01]. https://github.com/facebookresearch/detectron2.2019.

[35]

ATHAR A, BAAQUE M. A, KITTANEH S. Stem-seg: spatio-temporal embeddings for instance segmentation in videos [C] // European Conference on Computer Vision. Göttingen: Springer, 2020: 158-177.