Please wait a minute...
Chinese Journal of Engineering Design  2024, Vol. 31 Issue (2): 238-247    DOI: 10.3785/j.issn.1006-754X.2024.03.150
Robotic and Mechanism Design     
Pixel-level grasping pose detection for robots based on Transformer
Qingsong YU(),Xiangrong XU(),Yinzhen LIU
School of Mechanical Engineering, Anhui University of Technology, Maanshan 243032, China
Download: HTML     PDF(3692KB)
Export: BibTeX | EndNote (RIS)      

Abstract  

Robot grasping detection has always been a research focus in the field of robotics, but the robot faces the problem of inaccurate pose estimation when performing multi-object grasping tasks in complex environments. In order to improve this problem, a Transformer based grasping detection model called PTGNet (pyramid Transformer grasp network) was proposed. The PTGNet adopted Transformer modules with pyramid pooling structure and multi-head self-attention mechanism. The pyramid pooling structure could segment and pool feature maps to capture semantic information at different levels and reduce computational complexity, and the multi-head self-attention mechanism effectively extracted global information through powerful feature extraction capabilities, making PTGNet more suitable for visual grasping tasks. In order to verify the performance of the PTGNet, the training and testing for PTGNet were conducted based on different datasets, and the robot arm grasping experiments based on PTGNet were carried out in both simulated and real physical environments. The results showed that the accuracy of PTGNet on Cornell dataset and Jacquard dataset was 98.2% and 94.8%, respectively, showing excellent competitive performance. Compared with other detection models, the PTGNet had excellent generalization ability in multi-target datasets. In the single-object and multi-object grasping experiments conducted in the PyBullet simulation environment, the average grasping success rate of the robot arm reached 98.1% and 96.8%, respectively. In the multi-object grasping experiments conducted in the real physical environment, the average grasping success rate of the robot arm was 93.3%. The experimental results demonstrate the effectiveness and superiority of PTGNet in predicting multi-object grasping pose in complex environment.



Key wordsTransformer      pyramid pooling      grasp detection      multi-head self-attention     
Received: 13 April 2023      Published: 26 April 2024
CLC:  TP 242  
Corresponding Authors: Xiangrong XU     E-mail: 1451759340@qq.com;xuxr@ahut.edu.cn
Cite this article:

Qingsong YU,Xiangrong XU,Yinzhen LIU. Pixel-level grasping pose detection for robots based on Transformer. Chinese Journal of Engineering Design, 2024, 31(2): 238-247.

URL:

https://www.zjujournals.com/gcsjxb/10.3785/j.issn.1006-754X.2024.03.150     OR     https://www.zjujournals.com/gcsjxb/Y2024/V31/I2/238


基于Transformer的机器人像素级抓取位姿检测

机器人抓取检测一直是机器人领域的研究热点,但机器人在复杂环境下执行多物体抓取任务时面临位姿估计不准确的问题。为了解决这一问题,提出了一种基于Transformer的抓取检测模型——PTGNet(pyramid Transformer grasp network)。PTGNet采用具有金字塔池化结构和多头自注意力机制的Transformer模块,其中,金字塔池化结构能够对特征图进行分割和池化,以捕获不同层次的语义信息并降低计算复杂度,多头自注意力机制通过强大的特征提取能力有效地提取全局信息,使得PTGNet更适用于视觉抓取任务。为了验证PTGNet的性能,基于不同数据集对PTGNet进行训练和测试,并在仿真和真实物理环境下基于PTGNet开展机械臂抓取实验。结果表明,PTGNet在Cornell数据集和Jacquard数据集上的准确率分别为98.2%和94.8%,表现出具有竞争力的优异性能;在多目标数据集下,相比于其他检测模型,PTGNet具有优秀的泛化能力;在PyBullet仿真环境下开展的单对象和多对象抓取实验中,机械臂的平均抓取成功率分别达到了98.1%和96.8%;在真实物理环境下开展的多对象抓取实验中,机械臂的平均抓取成功率为93.3%。实验结果验证了PTGNet在复杂环境中预测多物体抓取位姿的有效性和优越性。


关键词: Transformer,  金字塔池化,  抓取检测,  多头自注意力 
Fig.1 PTGNet structure
Fig.2 Transformer layer structure with pyramid pooling
文献模型1)准确率/%检测用时/ms
图像分割对象分割
文献[22]Fast Search(RGB-D)60.558.35 000
文献[13]GG-CNN(D)73.069.019
文献[11]SAE(RGB-D)73.975.61 350
文献[24]Two-stage closed-loop(RGB-D)85.3140
文献[7]AlexNet, MultiGrasp(RGB-D)88.087.176
文献[25]STEM-CaRFs(RGB-D)88.287.5
文献[26]GRPN(RGB)88.7200
文献[6]ResNet-50x2(RGB-D)89.288.9103
文献[9]GraspNet(RGB-D)90.290.624
文献[27]ZF-Net(RGB-D)93.289.1
文献[28]GR-ConvNet(RGB-D)97.796.620
本文PTGNet(D)95.495.040.4
PTGNet(RGB)96.895.240.7
PTGNet(RGB-D)98.296.941.1
Table 1 Comparison of accuracy of different grasping detection models on Cornell dataset
文献模型1)准确率/%检测用时/ms
文献[23]Jacquard(RGB-D)74.2
文献[13]GG-CNN2(D)84.020
文献[29]FCGN, ResNet-101(RGB)91.8117
文献[30]Det Seg Refine(RGB)92.9532.3
文献[8]ROI-GD(RGB)93.6
文献[28]GR-ConvNet(RGB-D)94.620
本文PTGNet(D)93.341.5
PTGNet(RGB)93.741.9
PTGNet(RGB-D)94.842.5
Table 2 Comparison of accuracy of different grasping detection models on Jacquard dataset
Fig.3 Partial detection results of PTGNet on Cornell dataset
Fig.4 Partial detection results of PTGNet on Jacquard dataset
Fig.5 Partial detection results of different grasping detection models on multi-object dataset
Fig.6 Partial detection results of different grasping detection models on clutter dataset
数据集模型1)准确率/%检测用时/ms
多对象数据集GG-CNN(RGB-D)83.622
GR-ConvNet(RGB-D)94.724
PTGNet(RGB-D)95.147
杂乱数据集GG-CNN(RGB-D)82.934
GR-ConvNet(RGB-D)93.835
PTGNet(RGB-D)94.368
Table 3 Comparison of accuracy of different grasping detection models on multi-target dataset
Fig.7 Robot arm grasping experiment in simulation environment
任务场景模型抓取成功率/%
单对象场景GG-CNN83.4
PTGNet98.1
多对象场景GG-CNN79.6
PTGNet96.8
Table 4 Comparison of grasping success rate of robot arm in simulation environment
Fig.8 Robot arm grasping experiment in real physical environment
文献抓取成功率/%检测用时/ms
文献[11]89.0(89/100)1 350
文献[31]89.0(89/100)120
文献[8]90.6(29/32)40
文献[13]92.0(110/120)19
文献[28]93.0(93/100)20
本文93.3(168/180)41.1
Table 5 Comparison of grasping success rate of robot arm in real physical environment
[1]   BICCHI A, KUMAR V. Robotic grasping and contact: a review[C]//IEEE International Conference on Robotics and Automation. San Francisco, CA, Apr. 24-28, 2000.
[2]   BUCHHOLZ D, FUTTERLIEB M, WINKELBACH S, et al. Efficient bin-picking and grasp planning based on depth data[C]//2013 IEEE International Conference on Robotics and Automation. Karlsruhe, May 6-10, 2013.
[3]   卢进南,刘扬,王连捷,等.基于改进Mask Scoring R-CNN的铲齿磨损检测研究[J].工程设计学报,2022,29(3):309-317. doi:10.3785/j.issn.1006-754X.2022.00.046
LU J N, LIU Y, WANG L J, et al. Research on shovel tooth wear detection based on improved Mask Scoring R-CNN[J]. Chinese Journal of Engineering Design, 2022, 29(3): 309-317.
doi: 10.3785/j.issn.1006-754X.2022.00.046
[4]   李明,鹿朋,朱龙,等.基于RGB-D融合的密集遮挡抓取检测[J].控制与决策,2023,38(10):2867-2874.
LI M, LU P, ZHU L, et al. Densely occluded grasping objects detection based on RGB-D fusion[J]. Control and Decision, 2023, 38(10): 2867-2874.
[5]   楚红雨,冷齐齐,张晓强,等.融入注意力机制的多模特征机械臂抓取位姿检测[J].控制与决策,2024,39(3):777-785.
CHU H Y, LENG Q Q, ZHANG X Q, et al. Multi-modal feature robotic arm grasping pose detection with attention mechanism[J]. Control and Decision, 2024, 39(3): 777-785.
[6]   KUMRA S, KANAN C. Robotic grasp detection using deep convolutional neural networks[C]//2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Vancouver, Sep. 24-28, 2017.
[7]   REDMON J, ANGELOVA A. Real-time grasp detection using convolutional neural networks[C]//2015 IEEE International Conference on Robotics and Automation (ICRA). Seattle, WA, May 26-30, 2015.
[8]   ZHANG H, LAN X, BAI S, et al. ROI-based robotic grasp detection for object overlapping scenes[C]//2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Macau, Nov. 3-8, 2019.
[9]   ASIF U, TANG J B, HARRER S. GraspNet: an efficient convolutional neural network for real-time grasp detection for low-powered devices[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence. Stockholm, Jul. 13-19, 2018.
[10]   ZHU X, SUN L, FAN Y, et al. 6-DOF contrastive grasp proposal network[C]//2021 IEEE International Conference on Robotics and Automation (ICRA). Xi'an, May 30-Jun. 5, 2021.
[11]   LENZ I, LEE H, SAXENA A. Deep learning for detecting robotic grasps[J]. The International Journal of Robotics Research, 2015, 34(4/5): 705-724.
[12]   PARK D, SEO Y, SHIN D, et al. A single multi-task deep neural network with post-processing for object detection with reasoning and robotic grasp detection[C]//2020 IEEE International Conference on Robotics and Automation(ICRA). Paris, May 31-Aug. 31, 2020.
[13]   MORRISON D, CORKE P, LEITNER J. Learning robust, real-time, reactive robotic grasping[J]. The International Journal of Robotics Research, 2020, 39(2/3): 183-201.
[14]   VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[J]. Advances in Neural Information Processing Systems, 2017, 30: 1-15.
[15]   DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16 × 16 words: Transformers for image recognition at scale[C]//International Conference on Learning Representations. Online, May 3-7, 2021.
[16]   ZHANG Z, ZHANG H, ZHAO L, et al. Nested hierarchical Transformer: towards accurate, data-efficient and interpretable visual understanding[J]. Proceedings of the AAAI Conference on Artificial Intelligence. [S.l.]: AAAI, 2022: 3417-3425.
[17]   LIU Z, LIN Y, CAO Y, et al. Swin Transformer: hierarchical vision Transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, QC, Oct. 10-17, 2021.
[18]   WANG W, XIE E, LI X, et al. Pyramid vision Transformer: a versatile backbone for dense prediction without convolutions[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, QC, Oct. 10-17, 2021.
[19]   WU Y H, LIU Y, ZHAN X, et al. P2T: pyramid pooling Transformer for scene understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 2022(8): 1-12.
[20]   SI C, YU W, ZHOU P, et al. Inception Transformer[J]. Advances in Neural Information Processing Systems, 2022, 35: 23495-23509.
[21]   YUAN L, HOU Q, JIANG Z, et al. Volo: vision outlooker for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(5): 6575-6586.
[22]   JIANG Y, MOSESON S, SAXENA A. Efficient grasping from RGBD images: learning using a new rectangle representation[C]//2011 IEEE International Conference on Robotics and Automation. Shanghai, May 9-13, 2011.
[23]   DEPIERRE A, DELLANDRÉA E, CHEN L. Jacquard: a large scale dataset for robotic grasp detection[C]//2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Madrid, Oct. 1-5, 2018.
[24]   WANG Z, LI Z, WANG B, et al. Robot grasp detection using multimodal deep convolutional neural networks[J]. Advances in Mechanical Engineering, 2016, 8(9): 1-12.
[25]   ASIF U, BENNAMOUN M, SOHEL F A. RGB-D object recognition and grasp detection using hierarchical cascaded forests[J]. IEEE Transactions on Robotics, 2017, 33(3): 547-564.
[26]   KARAOGUZ H, JENSFELT P. Object detection approach for robot grasp detection[C]//2019 International Conference on Robotics and Automation (ICRA). Montreal, QC, May 20-24, 2019.
[27]   GUO D, SUN F, LIU H, et al. A hybrid deep architecture for robotic grasp detection[C]//2017 IEEE International Conference on Robotics and Automation (ICRA). Singapore, May 29-Jun. 2, 2017.
[28]   KUMRA S, JOSHI S, SAHIN F. Antipodal robotic grasping using generative residual convolutional neural network[C]//2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Las Vegas, NV, Oct. 25-29, 2020.
[29]   ZHOU X, LAN X, ZHANG H, et al. Fully convolutional grasp detection network with oriented anchor box[C]//2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Madrid, Oct. 1-5, 2018.
[30]   AINETTER S, FRAUNDORFER F. End-to-end trainable deep neural network for robotic grasp detection and semantic segmentation from RGB[C]//2021 IEEE International Conference on Robotics and Automation (ICRA). Xi'an, May 30-Jun. 5, 2021.
[31]   CHU F J, XU R, VELA P A. Real-world multiobject, multigrasp detection[J]. IEEE Robotics and Automation Letters, 2018, 3(4): 3355-3362.
[32]   WANG D, LIU C, CHANG F, et al. High-performance pixel-level grasp detection based on adaptive grasping and grasp-aware network[J]. IEEE Transactions on Industrial Electronics, 2021, 69(11): 11611-11621.
[1] Zhou HE,Yang WANG,Xiangyu JIANG,Zhaoxi HONG,Lili HE,Yixiong FENG. Collaborative design of complex product lifecycle value chain by fusing multi-agent demand frequency characteristics[J]. Chinese Journal of Engineering Design, 2024, 31(1): 1-9.
[2] Biao MA,Hui-bin QIN,Yi FENG,Xu-ri BAI,Jia-yi XIN. Design and experimental study of wheel amplitude transformer for rotary ultrasonic internal grinding[J]. Chinese Journal of Engineering Design, 2023, 30(1): 117-126.
[3] WANG Ying-long, LIU Ai-lian, ZHAI Shao-lei, ZHU Quan-cong, GU Hong-bo, LI Chuan. Research on application of fundamental wave extraction algorithm based on adaptive frequency tracking[J]. Chinese Journal of Engineering Design, 2018, 25(1): 56-61.
[4] LIANG Xin, LV Ming, WANG Shi-ying, QIN Hui-bin. Design of longitudinal-flexural coupling vibration transformer with large amplitude used in gear ultrasonic machining[J]. Chinese Journal of Engineering Design, 2015, 22(2): 172-177.
[5] QIN Hui-Bin, LV Ming , WANG Shi-Ying, SHE Yin-Zhu. Design and experiment research of longitudinal vibration system in gear ultrasonic machining[J]. Chinese Journal of Engineering Design, 2013, 20(2): 140-145.
[6] SHE Yin-Zhu, LV Ming , WANG Shi-Ying. Research and design of bending vibration transformer in ultrasonic gear honing[J]. Chinese Journal of Engineering Design, 2011, 18(6): 471-476.