<img src="https://www.zjujournals.com/eng/images/1008-973X/images/logo.png" class="img-responsive">

图 1 PCT-Adapter网络的结构图

Fig.1 Structure diagram of PCT-Adapter network

2.1. 标准点云Transformer

目前，主流的标准点云Transformer预训练模型，如Point-Bert^[10]、Point-MAE^[9]、ACT^[7]、ReCon^[3]等，虽然使用了不同的预训练策略来增强点云特征学习能力，但它们具有相同的点云Transformer主干网络和统一的预训练参数结构. PCT-Adapter中的Transformer主干属于通用结构，可以直接使用多种预训练策略的权重. 接下来阐述PCT-Adapter使用的标准Transformer模块.

2.1.1. 输入序列编码模块

给定输入的点云数据$ {\boldsymbol{P}} \in {{\bf{R}}^{n \times 3}} $，输入序列编码使用最远点采样(farthest point sample，FPS) ，提取$ S $个具有代表性的中心点$ {{\boldsymbol{P}}_S} \in {{\bf{R}}^{S \times 3}} $. 以每个采样点为中心，使用$ K $最近邻算法(K-nearest neighborhood，KNN) 聚合$ K $个邻近点，得到$ S $个局部点云块$ {\boldsymbol{G}} \in {{\bf{R}}^{S \times K \times 3}} $. $ S $个局部点云与对应的中心相减，消除全局坐标属性. 通过多层感知机(multi-layer perceptron，MLP)和最大池化层聚合局部点云信息，编码为特征$ {\boldsymbol{T}} \in {{\bf{R}}^{S \times D}} $.

位置编码在Transformer结构中具有关键作用. 本文遵循视觉和点云领域中标准Transformer^[18]的设计架构，为序列特征添加了位置编码. 利用由2个线性层和1个GELU激活函数组成的MLP，将$ S $个中心点坐标编码为$ D $维向量，形成位置嵌入$ {{\boldsymbol{E}}_{{\text{pos}}}} \in {{\bf{R}}^{S \times D}} $. 将点云编码$ {\boldsymbol{T}} $与位置编码$ {{\boldsymbol{E}}_{{\text{pos}}}} $相加，得到特征序列$ {{\boldsymbol{X}}_{{\text{patch}}}} $.

除此之外，还要向输入序列中添加由可学习参数与随机位置编码组成的类标记$ {\boldsymbol{X}}_{{\text{cls}}}^{} \in {{\bf{R}}^{1 \times D}} $，用于无偏差地学习数据集的整体特征. 标准Transformer的输入序列$ {\boldsymbol{F}}_{{\text{st}}}^0 $可以表示为

(1)$ {\boldsymbol{F}}_{{\text{st}}}^0 = {{\boldsymbol{X}}_{{\text{patch}}}} \oplus {{\boldsymbol{X}}_{{\text{cls}}}}. $

式中：$ \oplus $为特征拼接操作，$ {\boldsymbol{F}}_{{\text{st}}}^0 \in {{\bf{R}}^{(1+S) \times {{D}}}} $.

2.1.2. Transformer主干

为了提升特征融合能力，在标准点云Transformer与Adapter之间设计双向特征交互模块，并重复$ N $次. 根据双向交互次数，将具有$ L $个统一尺度Transformer 层的网络主干划分为$ N $个块. 假设标准Transformer主干使用FPS算法采样$ S $个中心点，即把点云划分为$ S $块，则第$ i $个Transformer块的输入可以表示为$ {\boldsymbol{F}}_{{\text{st}}}^{i - 1} \in {{\bf{R}}^{(1+S) \times D}} $.

在第$ i $个BFI模块中，首先完成从Adapter到Transformer的特征交互(A-to-T)，将先验知识$ {\boldsymbol{F}}_{{\text{prior}}}^{i - 1} $注入到第$ i $个Transformer 块的输入特征$ {\boldsymbol{F}}_{{\text{st}}}^{i - 1} $，实现特征的补充，得到特征$ \hat {\boldsymbol{F}}_{{\text{st}}}^{i - 1} \in {{\bf{R}}^{(1+S) \times D}} $. 第$ i $个Transformer 块输出可以表示为

(2)$ {\boldsymbol{F}}_{{\text{st}}}^i = {{\mathrm{Block}}} (\hat {\boldsymbol{F}}_{{\text{st}}}^{i - 1}+{\boldsymbol{F}}_{{\text{st}}}^{i - 1}). $

式中：$ \mathrm{Block}(·) $为Transformer主干中的第$ i $个块.

随后，从Transformer到Adapter的特征交互(T-to-A) 将输出特征$ {\boldsymbol{F}}_{{\text{st}}}^i $从Transformer映射至适配器，完成对先验知识$ {\boldsymbol{F}}_{{\text{prior}}}^{i - 1} $的更新与优化.

遵循标准Transformer预训练模型的结构，PCT-Adapter设置L为12，S与KNN参数K根据密集预测任务灵活调整. 根据融合次数对模型性能的影响，设置N为6.

2.2. 先验特征提取(PFE)模块

相比于图像，点云具有无序、稀疏、不规则的特性，传统的卷积难以完全概括点云的几何先验特征. 为此，设计先验特征提取(PFE)模块，灵活地提取点云先验特征. 如图2所示，PFE模块层次化地使用多个几何抽象(geometric set abstraction，GSA)模块，提高对不同尺度目标的感知能力. GSA模块结合相对位置编码^[21]与几何变换^[22]模块，增强对局部几何特征的捕捉能力.

图 2

图 2 先验特征提取模块的结构

Fig.2 Structure of prior feature extraction module

具体而言，对于输入点云$ {{\boldsymbol{P}}^0} \in {{\bf{R}}^{n \times 3}} $，PFE模块使用MLP将其编码为特征$ {{\boldsymbol{F}}^0} \in {{\bf{R}}^{n \times C}} $，分层部署多个GSA模块，提取多尺度几何细粒度特征.

以第$ j $个GSA模块为例，使用FPS和KNN算法，对输入点$ {{\boldsymbol{P}}^{j - 1}} \in {{\bf{R}}^{{C^{j - 1}} \times 3}} $及其特征$ {{\boldsymbol{F}}^{j - 1}} \in {{\bf{R}}^{{C^{j - 1}} \times {D^{j - 1}}}} $进行尺度为$ {C^j} $的局部聚合. 该过程挑选出中心点$ {\boldsymbol{P}}_{{\text{fps}}}^j \in {{\bf{R}}^{{C^j} \times 3}} $、邻域点$ {\boldsymbol{P}}_{{\text{knn}}}^j \in {{\bf{R}}^{{C^j} \times K \times 3}} $以及它们对应的特征$ {\boldsymbol{F}}_{{\text{fps}}}^j \in {{\bf{R}}^{{C^j} \times {D^{j - 1}}}} $、$ {\boldsymbol{F}}_{{\text{knn}}}^j \in {{\bf{R}}^{{C^j} \times K \times {D^{j - 1}}}} $.

GSA模块结合相对位置编码，增强了几何感知能力. 具体来说，计算每个点云区域内的邻域点与对应的采样中心的相对坐标，使用平均值和标准差对几何信息进行细化，得到$ \Delta {\boldsymbol{P}}_{{\text{knn}}}^j $. 更新邻域特征的步骤表示如下:

(3)$ {\boldsymbol{F}}_{{\text{pos}}}^j = {\psi _{{\text{pos}}}}({\boldsymbol{F}}_{{\text{knn}}}^j,{{\mathrm{PosE}}} (\Delta {\boldsymbol{P}}_{{\text{knn}}}^j)). $

式中：$ \text{PosE}(\cdot ) $表示使用三角函数将$ \Delta {\boldsymbol{P}}_{{\text{knn}}}^j $映射为$ {D^{j - 1}} $维向量，$ {\psi }_{\text{pos}}(\cdot ) $为相对位置编码.

为了应对真实世界数据集^[23-24]中某些区域的稀疏性和不规则几何结构所带来的挑战，引入可学习的几何仿射变换(geometric affine)^[22]模块，增强GSA模块的几何结构提取能力. 该模块使用$ {\boldsymbol{F}}_{{\text{fps}}}^j $、$ {\boldsymbol{F}}_{{\text{pos}}}^j $及它们对应的坐标$ {\boldsymbol{P}}_{{\text{fps}}}^j $和$ {\boldsymbol{P}}_{{\text{knn}}}^j $作为输入，得到$ {C^j} $尺度下的点云局部区域几何特征$ {{\boldsymbol{G}}^j} \in {{\bf{R}}^{{C^j} \times K \times {D^{j - 1}}}} $，

(4)$ {{\boldsymbol{G}}^j} = {\varPhi _{{\text{affine}}}}({\boldsymbol{F}}_{{\text{fps}}}^j,{\boldsymbol{F}}_{{\text{pos}}}^j,{\boldsymbol{P}}_{{\text{knn}}}^j,{\boldsymbol{P}}_{{\text{fps}}}^j). $

式中：$ {\varPhi }_{\text{affine}}(\cdot ) $为几何仿射变换模块. 将采样的中心特征与$ {{\boldsymbol{G}}^j} $进行融合，第$ j $个GSA模块的输出可以表示为

(5)$ {\boldsymbol{F}}_{}^j{\text{ = }}{{\boldsymbol{P}}_{{\text{max}}}}{\text{(MLP(}}{{\boldsymbol{G}}^j} \oplus {\boldsymbol{F}}_{f{\text{ps}}}^j{\text{)}}). $

式中：$ {{\boldsymbol{P}}_{{\text{max}}}} $为最大池化层，$ \oplus $为特征拼接操作.

使用线性层将GSA模块的输出特征与Transformer的输入特征进行对齐，以便于Adapter与Transformer的交互. 将GSA模块中不同尺度的特征进行拼接，得到多尺度特征金字塔.

2.3. 双向特征交互(BFI)模块

BFI模块由N个交互块组成，与Transformer块进行一一对应的特征交互. 如图3所示，每个特征交互模块包括2个方向相反的交互操作，分别是Adapter到Transformer (A-to-T)和Transformer到Adapter (T-to-A).

图 3

图 3 双向特征交互模块的结构

Fig.3 Structure of bidirectional feature interaction module

2.3.1. A-to-T模块

A-to-T模块能够在不破坏标准Transformer块的情况下，为Transformer注入多尺度先验特征，增强标准Transformer的特征学习能力. A-to-T模块使用交叉注意力^[25] ，为Transformer注入多尺度几何信息. 在不同的密集预测任务中，不同的尺度先验特征在多尺度特征金字塔中具有不同的重要性. 针对多尺度特征金字塔，在A-to-T模块中引入一组可学习参数，能够在不同尺度下对特征的融合比例进行自适应调整，增强模型在各种任务和场景中的适应性和分割性能.

以S3DIS^[24]为例，若PFE模块的采样率设置为$ C = \left\{ {n/4,n/16,n/64,n/256} \right\} $，则第$ i $个A-to-T模块使用多尺度先验知识$ {\boldsymbol{F}}_{{\text{Prior}}}^{i - 1} \in {{\bf{R}}^{\left( {n/4+n/16+n/64+n/256} \right) \times D}} $作为交叉注意力的键与值，Transformer 块的输入特征$ {\boldsymbol{F}}_{{\text{st}}}^{i - 1} \in {{\bf{R}}^{(1+S) \times D}} $作为查询. 由此，先验知识注入过程表示如下:

(6)$ \hat {\boldsymbol{F}}_{{\text{st}}}^{i - 1} = {{\boldsymbol{\gamma}} ^i} \odot {{\mathrm{MCA}}} ({{\mathrm{LN}}} ({\boldsymbol{F}}_{{\text{st}}}^{i - 1}),{{\mathrm{LN}}} ({\boldsymbol{F}}_{{\text{Prior}}}^{i - 1}))+{\boldsymbol{F}}_{{\text{st}}}^{i - 1}. $

式中：$ \text{MCA}(\cdot ) $为交叉注意力，其中包含与特征金字塔层次对应的多个原始交叉注意力，输出融合不同尺度的主干特征；$ {{\boldsymbol{\gamma}} ^i} \in {{\bf{R}}^{4 \times D}} $为4个可学习参数的组合；$ \odot $为点乘；$ \mathrm{LN}(\cdot ) $为层级归一化操作；融合先验知识的特征$ \hat {\boldsymbol{F}}_{{\text{st}}}^{i - 1} \in {{\bf{R}}^{(1+S) \times D}} $为第$ i $个Transformer块的输入.

2.3.2. T-to-A模块

T-to-A模块接收Transformer 块的输出特征，用于更新多尺度点云金字塔特征，为关注局部几何信息的金字塔补充全局特征，从而提取到更全面的点云信息.

T-to-A模块使用交叉注意力，更新特征金字塔中的各尺度特征. 此外，该模块采用结构简单、空间占用小的共享参数MLP来学习通用特征，促进了不同尺度间的信息传播.

交叉注意力模块以$ {\boldsymbol{F}}_{{\mathrm{st}}}^i $作为键与值，$ {\boldsymbol{F}}_{{\text{Prior}}}^{i - 1} $作为查询，为多尺度先验知识补充全局特征的过程表示如下:

(7)$ \hat {\boldsymbol{F}}_{{\text{Prior}}}^{i - 1} = {{\mathrm{MCA}}} ({{\mathrm{LN}}} ({\boldsymbol{F}}_{{\text{st}}}^i),{{\mathrm{LN}}} ({\boldsymbol{F}}_{{\text{Prior}}}^{i - 1}))+{\boldsymbol{F}}_{{\text{Prior}}}^{i - 1}. $

式中：$ \text{MCA}(\cdot ) $为交叉注意力，其中包含与特征金字塔层次对应的多个原始交叉注意力，为多个尺度的先验特征注入全局特征；$ \mathrm{LN}(\cdot ) $为层级归一化操作.

使用共享权重的MLP，对不同尺度的先验特征进行跨尺度的信息交互与融合:

(8)$ {\boldsymbol{F}}_{{\text{Prior}}}^i = {{{\mathrm{MLP}}} _{{\text{share}}}}(\hat {\boldsymbol{F}}_{{\text{Prior}}}^i). $

式中：$ {\mathrm{MLP}}_{\text{share}}(\cdot ) $为共享权重的多层感知机，由2个线性层与1个RELU激活函数组成. MLP被重复用于各个尺度特征的更新，每个更新过程共享权重参数. 更新后的多尺度特征金字塔具备更稳健的特征表示，为标准Transformer提供了更准确的点云结构信息.

3. 实验与结果分析

3.1. 数据集与实验设置

3.1.1. ShapeNetPart

ShapeNetPart^[11]数据集包含16 881个对象、16个类别，且拥有50个部分标签. 每个对象由2~6个部分组成.

在模型设置上，严格遵循Point-Bert^[10]，随机选择2 048个点作为每个对象的输入，标准Transformer采样数目$ S $设置为128，KNN参数$ K $设置为32，GSA模块的数量设置为3，下采样率为$ C = \left\{ n{\text{/}}8, n{\text{/}}16, n{\text{/32}} \right\} $，KNN参数与主干网络一致.

在训练设置上，由于ShapeNetPart标签分布较均衡，使用交叉熵损失训练PCT-Adapter，在显存为24 GB的NVIDIA 4090服务器上使用AdamW优化器，当学习率为0.000 5，批处理大小为6时，训练300个轮次.

在结果评价指标上，使用类别平均交并比mIoU_cls(class-wise intersection over union)和实例平均交并比mIoU_ins(instance-wise intersection over union). mIoU_cls体现模型在所有类别中的平均分割性能，反映了模型对不同类别的泛化能力. mIoU_ins侧重于评估模型对每个具体实例的分割效果，特别是在处理个体差异和复杂部件结构时的表现.

3.1.2. S3DIS

S3DIS^[24]数据集涵盖了来自3个不同建筑的6个大型室内区域，共2.73亿个点，标注有13个类别(天花板、地板、桌子等) . 使用区域5进行测试，其他区域用于训练.

在模型设置上，由于每个点云场景的数据量较大，随机剪裁12 000个点作为每个场景的输入，设置标准Transformer 主干采样数目$ S $为256，$ K $为32. GSA模块数为4，下采样率$ C = \left\{ n{\text{/4}},n{\text{/}}16,n{\text{/64}}, n{\text{/256}} \right\} $，$ K $为32.

在训练设置上，采用加权交叉熵损失，以处理类别不平衡问题. 通过统计输入点云中的类别分布，计算每个类别的权重. 训练在显存为24 GB的NVIDIA 4090上进行，使用AdamW优化器，学习率设为0.000 5，批次大小为6，训练总轮次为250.

在结果评价指标上，使用的实验指标包括整体准确率(overall accuracy, OA)、平均类别准确率(mean accuracy, mAcc)和平均交并比(mean intersection over union, mIoU). OA用于评估模型在所有数据点上的表现. 计算每个类别的准确率，然后取平均值得到mAcc，mAcc反映模型在所有类别上的平均表现. 计算每个类别的预测结果与真实标签的交集与并集的比例，对所有类别进行平均得到mIoU. mIoU直接反映了模型在空间精度上的表现，是S3DIS分割任务中最主要的评价指标.

3.1.3. SemanticKITTI

SemanticKITTI^[23]是真实世界路况的数据集，包含21个序列和43 552帧点云. 使用序列00~07以及09、10共19 130 423帧作为训练集，序列08(4 071帧)作为验证集.

在模型设置上，随机剪切每个场景中的20 000个点作为输入. Transformer主干与GSA模块的KNN参数$ K $为64，其他PCT-Adapter网络参数与S3DIS一致. 在训练设置方面，采用权重交叉熵损失，并使用AdamW优化器. 训练在NVIDIA 4090上进行，学习率设为0.000 5，批次大小为2，训练总轮次为100.

在结果评价指标上，由于SemanticKITTI 数据集中存在显著的类别不平衡，如车辆和建筑数量远大于行人和自行车的样本，遵循以往论文实验设置和自动驾驶数据集对于空间位置的精确预测需求，使用mIoU作为评价指标. 通过对每个类别进行单独评估并求平均，减少大类别对结果的支配性影响.

3.2. ShapeNetPart数据集的实验结果

如表1所示，Point-Bert和PCT-Adapter均加载Point-Bert预训练参数，并在ShapeNetPart数据集上进行部件分割测试. 通过加入Adapter结构，PCT-Adapter将部件分割的mIoU_cls和mIoU_ins均提高了0.4%. 本文的PFE模块利用相对位置编码与几何仿射模块，增强了对复杂局部区域的特征学习能力. ShapeNetPart数据集的规模相对较小，样本结构相对简单且部件间的差异性不大，PFE模块未取得理想的效果，导致PCT-Adapter在该数据集上的性能提升有限.

表 1 ShapeNetPart数据集的部件分割结果

Tab.1 Part segmentation result of ShapeNetPart dataset

模型	mIoU_cls/%	mIoU_ins/%
PointNet++^[26]	81.9	85.1
PointASNL^[27]	—	86.1
PCT^[6]	—	86.4
PointTransformer^[5]	83.7	86.6
PointCAT^[28]	84.4	86.0
Point-Bert^[10]	84.1	85.6
PCT-Adapter	84.5	86.0

PCT-Adapter有效缩小了标准Transformer与变体Transformer^{[6, 28]}之间的性能差距. 这一结果表明，PCT-Adapter有助于解决标准Transformer的结构局限性问题. 随着点云预训练方法与数据集的发展，当标准Transformer预训练模型的特征提取能力被进一步开发时，PCT-Adapter的性能将优于变体Transformer.

3.3. S3DIS数据集的实验结果

如表2所示，Point-Bert和PCT-Adapter均加载Point-Bert预训练权重，在真实室内数据集S3DIS上进行语义分割测试. 相较于标准Transformer，PCT-Adapter分别提高了4.8% 的mAcc与和5.5% 的mIoU. PCT-Adapter超过部分变体Transformer^{[6, 13-14, 28]}的性能.

表 2 S3DIS数据集(区域5)的语义分割结果

Tab.2 Semantic segmentation result of S3DIS dataset (area 5)

方法	mAcc/ %	mIoU/ %	mIoU_cls/%
方法	mAcc/ %	mIoU/ %	天花板	地板	墙壁	横梁	柱子	窗户	门	桌子	椅子	沙发	书柜	黑板	杂物
SPG^[27]	66.5	58.0	89.4	96.9	78.1	0.0	42.8	48.9	61.6	84.7	75.4	69.8	52.6	2.1	52.2
PointWeb^[2]	66.6	60.3	92.0	98.5	79.4	0.0	21.1	59.7	34.8	76.3	88.3	46.9	69.3	64.9	52.5
PAT^[29]	70.8	60.1	93.0	98.5	72.3	1.0	41.5	85.1	38.2	57.7	83.6	48.1	67.0	61.3	33.6
PT^[5]	76.5	70.4	94.0	98.5	86.3	0.0	38.0	63.4	74.3	89.1	82.4	74.3	80.2	76.0	59.3
PCT^[6]	67.7	61.3	92.5	98.4	80.6	0.0	19.3	61.6	48.0	76.6	85.2	46.2	67.7	67.9	52.3
PatchF^[13]	—	67.3	91.8	98.7	86.2	0.0	34.1	48.9	62.4	81.6	89.8	47.2	74.9	74.4	58.6
PointCAT^[28]	71.0	64.0	94.2	98.3	80.5	0.0	18.6	55.5	58.9	77.2	88.0	64.8	72.2	68.9	55.4
SPFormer^[14]	77.3	68.9	91.5	98.2	81.4	0.0	23.3	65.3	40.0	75.5	87.7	59.5	67.8	65.6	49.4
Point-Bert^[10]	75.7	63.5	91.3	92.3	73.1	0.0	33.9	65.6	60.4	76.5	82.7	86.8	64.0	41.7	43.0
PCT-Adapter	80.5	69.0	91.9	96.0	81.6	0.0	52.4	66.5	67.0	82.9	90.1	70.8	72.8	69.5	54.7

如图4所示为PCT-Adapter与Point-Bert在区域5上的分割结果. PCT-Adapter在维持总体分割质量的同时，增强了细节的分割效果. 这些结果表明，PCT-Adapter有效地将标准Transformer扩展到下游任务，证明了设计的合理性.

图 4

图 4 在S3DIS数据集(区域5)上的分割可视化效果

Fig.4 Segmentation visualization result on S3DIS (area 5)

3.4. SemanticKITTI数据集的实验结果

如表3所示, Point-Bert和PCT-Adapter模型均加载Point-Bert预训练参数,在真实室外数据集SemanticKITTI上进行语义分割测试.

表 3 在SemanticKITTI上采用不同方法的定量结果

Tab.3 Quantitative result of different methods on SemanticKITTI

方法	输入	mIoU/%
PointNet++^[26]	50 000个点	20.1
SPG^[27]		17.4
SPLATNet^[30]		18.4
TangentConv^[31]		40.9
SqueezeSegV2^[32]	64×2 048像素	39.7
DarkNet21Seg^[23]		47.4
DarkNet53Seg^[23]		49.9
Point-Bert^[10]	20 000个点	44.5
PCT-Adapter	20 000个点	53.4

实验结果表明，相较于标准Transformer，PCT-Adapter的mIoU提高了8.9%. PCT-Adapter 在更复杂、多样化的真实室外场景中对室外Transformer性能的提升归因于Adapter对标准Transformer 结构的有益补充和高质量的特征交互, 这证明了PCT-Adapter的细粒度特征捕获能力和有效的辅助作用.

3.5. 消融实验

对Adapter各个部分的有效性进行验证. 将PCT-Adapter拆分为Adapter与标准Transformer，用于验证PCT-Adapter设计的合理性. 针对双向交互频率，对PCT-Adapter的影响进行定量分析. 针对多样的预训练方法，检验了Adapter的跨模型通用性. 所有消融实验均在S3DIS数据集上开展，使用相同的硬件条件、优化器、迭代次数及批处理大小.

3.5.1. Adapter部件消融实验

为了证明Adapter各模块的有效性，将PCT-Adapter划分为Transformer模块、PFE模块与BFI模块，将BFI模块划分为A-to-T模块与T-to-A模块. 当仅使用PFE模块时，利用向量加法，将先验知识融合至标准Transformer的输出特征.

从表4可知，与点云标准Transformer相比，仅添加PFE模块即可实现2.0%的mIoU性能提升. 引入BFI模块后，性能得到更显著的提高. 完整的PCT-Adapter模型效果提升最明显. 单独使用Adapter模块在语义分割上的表现不具有竞争性. 在结合Adapter与Transformer结构后，多尺度先验知识提取作用得到充分的应用，增强了Point-Bert的任务适配性与整体分割性能. 这些实验结果证明了Adapter各个模块的有效性.

表 4 PCT-Adapter在S3DIS数据集上的消融实验

Tab.4 Ablation experiment of PCT-Adapter on S3DIS dataset

Transformer	PFE	BFI		mIoU/%	mAcc/%
Transformer	PFE	A-to-T	T-to-A	mIoU/%	mAcc/%
✕	✓	✕	✕	55.6	65.7
✓	✕	✕	✕	63.5	75.7
✓	✓	✕	✕	65.5	75.8
✓	✓	✓	✕	66.5	76.6
✓	✓	✓	✓	69.0	80.5

3.5.2. 双向特征交互次数的定量分析

如图1所示，为了实现Adapter与Transformer的深度融合，将它们分别划分为$ N $个块，并进行$ N $次双向特征交互. 将$ N $设置为0、1、2、4、6和8，在相同的训练设置下，评估$ N $对本文方法的影响.

在S3DIS数据集上不同N的语义分割效果如表5所示. 当Adapter与Transformer交互次数为0时，PCT-Adapter等价于标准Transformer，分割指标mIoU为63.5%. 随着特征交互次数的增加，模型性能得到提升，当交互次数为6时性能最佳. 实验结果证明，多次特征交互对提升PCT-Adapter的性能起到了重要作用.

表 5 S3DIS数据集上交互次数的定量比较

Tab.5 Quantitative comparison of number of interactions on S3DIS dataset

N	mIoU/%	mAcc/%
0	63.5	75.7
1	65.9	76.1
2	66.4	76.0
4	67.0	76.8
6	69.0	80.5
8	68.9	77.6

3.5.3. 不同预训练权重的效果

为了证明PCT-Adapter在密集预测任务上具有跨模型通用性，比较多种预训练权重对PCT-Adapter适配效果的影响. 预训练模型包括Point-Bert^[10]、ReCon^[3]、ACT^[7]、Point-MAE^[9]和MaskPoint^[8].

从表6的实验结果可得如下结论. 1）本文提出的Adapter提升了标准Transformer的性能. 2）不同的预训练权重均能够增强PCT-Adapter在密集预测任务上的性能. 3）Point-Bert预训练参数使得PCT-Adapter达到最佳的性能.

表 6 加载各种预训练模型权重的定量比较

Tab.6 Quantitative comparison of loaded weights from various pretrained models

方法	预训练模型权重	mIoU/%	mAcc/%
标准Transformer	—	62.9	72.1
	ACT^[7]	63.1	73.2
	Point-MAE^[9]	63.1	72.1
	MaskPoint^[8]	64.4	72.7
	Point-Bert^[10]	63.5	75.7
	ReCon^[3]	64.8	73.3
PCT-Adapter	—	66.2	73.7
	ACT^[7]	66.9	75.5
	Point-MAE^[9]	68.9	76.5
	MaskPoint^[8]	68.2	75.9
	Point-Bert^[10]	69.0	80.5
	ReCon^[3]	67.4	74.7

4. 结　语

本文提出将点云通用标准 Transform-er主干适配至密集预测任务的PCT-Adapter框架. 在不改变标准Transformer架构的情况下，通过免预训练的特征提取模块和双向特征交互模块，增强了标准Transformer的特征表达能力. 此外，提出的PCT-Adapter具有跨模型通用性，能够加载各种预训练方法的权重. 在以分割为代表的密集预测任务上，PCT-Adapter的任务适配性能得到验证，有效缩小了标准Transformer与变体Transformer的性能差异，为标准Transformer作为点云领域的统一架构提供了可行性.

尽管如此，PCT-Adapter存在一些局限性. 虽然PCT-Adapter在较复杂的数据集上表现出强大的密集预测性能和通用性，但对于ShapeNetPart这类部件结构较简单的数据集，未取得理想的效果. 此外，双向特征交互模块引入的交叉注意力机制虽然提升了PCT-Adapter的性能，但增加了模型的计算复杂度和算力成本. 这些问题为PCT-Adapter的后续改进指明了方向.

参考文献

原文顺序

文献年度倒序

文中引用次数倒序

被引期刊影响因子

[1]

CHOE J, PARK C, RAMEAU F, et al. Pointmixer: Mlp-mixer for point cloud understanding [C]// European Conference on Computer Vision . Israel: Springer, 2022: 620-640.

[2]

ZHAO H, JIANG L, FU C W, et al. Pointweb: enhancing local neighborhood features for point cloud processing [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 5565-5573.

[3]

QI Z, DONG R, FAN G, et al. Contrast with reconstruct: contrastive 3d representation learning guided by generative pretraining [C]// International Conference on Machine Learning . Honolulu: PMLR, 2023: 28223-28243.

[本文引用: 5]

[4]

YANG Y Q, GUO Y X, XIONG J Y, et al. Swin3D: a pretrained Transformer backbone for 3D indoor scene understanding [EB/OL]. (2023-08-26) [2024-05-25]. https://arxiv.org/abs/2304.06906.

[5]

ZHAO H, JIANG L, JIA J, et al. Point transformer [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Montreal: IEEE, 2021: 16259-16268.

[6]

GUO M H, CAI J X, LIU Z N, et al

Pct: point cloud transformer

[J]. Computational Visual Media, 2021, 7: 187- 199

DOI:10.1007/s41095-021-0229-5 [本文引用: 8]

[7]

DONG R, QI Z, ZHANG L, et al. Autoencoders as cross-modal teachers: can pretrained 2D image Transformers help 3D representation learning? [EB/OL]. (2023-02-02) [2024-05-25]. https://arxiv.org/abs/2212.08320.

[本文引用: 7]

[8]

LIU H, CAI M, LEE Y J. Masked discrimination for self-supervised learning on point clouds [C]// European Conference on Computer Vision . Tel Aviv: Springer, 2022: 657-675.

[本文引用: 5]

[9]

PANG Y, WANG W, TAY F E H, et al. Masked autoencoders for point cloud self-supervised learning [C]// European Conference on Computer Vision . Tel Aviv: Springer, 2022: 604-621.

[本文引用: 5]

[10]

YU X, TANG L, RAO Y, et al. Point-bert: pre-training 3d point cloud transformers with masked point modeling [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Oreans: IEEE, 2022: 19313-19322.

[本文引用: 12]

[11]

CHANG A X, FUNKHOUSER T, GUIBAS L, et al. Shapenet: an information-rich 3d model repository [EB/OL]. (2015-12-09) [2024-05-25]. http://arxiv.org/abs/1512.03012.

[12]

TANG Y, ZHANG R, GUO Z, et al. Point-PEFT: parameter-efficient fine-tuning for 3D pre-trained models [C]// Proceedings of the AAAI Conference on Artificial Intelligence . Vancouver: AAAI, 2024: 5171-5179.

[13]

ZHANG C, WAN H, SHEN X, et al. Patchformer: an efficient point transformer with patch attention [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans: IEEE, 2022: 11799-11808.

[14]

SUN J, QING C, TAN J, et al. Superpoint transformer for 3d scene instance segmentation [C]// Proceedings of the AAAI Conference on Artificial Intelligence . Washington DC: AAAI, 2023: 2393-2401.

[本文引用: 3]

[15]

DEVLIN J, CHANG M W, LEE K, et al. Bert: pre-training of deep bidirectional transformers for language understanding [C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Minneapolis: [s. n.], 2019: 4171–4186.

[16]

STICKLAND A C, MURRAY I. Bert and pals: projected attention layers for efficient adaptation in multi-task learning [C]// International Conference on Machine Learning . Long Beach: ACM, 2019: 5986-5995.

[17]

CHEN Z, DUAN Y, WANG W, et al. Vision transformer adapter for dense predictions [EB/OL]. (2023-02-13) [2024-05-25]. https://arxiv.org/abs/2205.08534.

[18]

VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems . Red Hook: Curran Associates Inc. , 2017: 6000–6010.

[19]

WANG W, XIE E, LI X, et al

Pvt v2: improved baselines with pyramid vision transformer

[J]. Computational Visual Media, 2022, 8 (3): 415- 424

DOI:10.1007/s41095-022-0274-8 [本文引用: 1]

[20]

WU X, TIAN Z, WEN X, et al. Towards large-scale 3d representation learning with multi-dataset point prompt training [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle: IEEE, 2024: 19551-19562.

[21]

ZHANG R, WANG L, WANG Y, et al. Starting from non-parametric networks for 3d point cloud analysis [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Vancouver: IEEE, 2023: 5344-5353.

[22]

MA X, QIN C, YOU H, et al. Rethinking network design and local geometry in point cloud: a simple residual MLP framework [EB/OL]. (2022-11-29) [2024-05-25]. https://arxiv.org/abs/2202.07123.

[23]

BEHLEY J, GARBADE M, MILIOTO A, et al. Semantickitti: a dataset for semantic scene understanding of lidar sequences [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Seoul: IEEE, 2019: 9297-9307.

[24]

ARMENI I, SENER O, ZAMIR A R, et al. 3d semantic parsing of large-scale indoor spaces [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas: IEEE, 2016: 1534-1543.

[本文引用: 3]

[25]

HAN X F, HE Z Y, CHEN J, et al

3CROSSNet: cross-level cross-scale cross-attention network for point cloud representation

[J]. IEEE Robotics and Automation Letters, 2022, 7 (2): 3718- 3725

DOI:10.1109/LRA.2022.3147907 [本文引用: 1]

[26]

QI C R, YI L, SU H, et al. Pointnet++: deep hierarchical feature learning on point sets in a metric space [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems . Red Hook: Curran Associates Inc. , 2017: 5105–5114.

[27]

YAN X, ZHENG C, LI Z, et al. Pointasnl: robust point clouds processing using nonlocal neural networks with adaptive sampling [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle: IEEE, 2020: 5589-5598.

[本文引用: 3]

[28]

YANG X, JIN M, HE W, et al. PointCAT: cross-attention Transformer for point cloud [EB/OL]. (2023-04-06) [2024-05-25]. https://arxiv.org/abs/2304.03012.

[29]

YANG J, ZHANG Q, NI B, et al. Modeling point clouds with self-attention and gumbel subset sampling [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 3323-3332.

[30]

SU H, JAMPANI V, SUN D, et al. Splatnet: sparse lattice networks for point cloud processing [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Salt Lake: IEEE, 2018: 2530-2539.

[31]

TATARCHENKO M, PARK J, KOLTUN V, et al. Tangent convolutions for dense prediction in 3d [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Salt Lake: IEEE, 2018: 3887-3896.

[32]

WU B, ZHOU X, ZHAO S, et al. Squeezesegv2: improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud [C]// IEEE International Conference on Robotics and Automation . Montreal: IEEE, 2019: 4376-4382.