Please wait a minute...
浙江大学学报(工学版)  2023, Vol. 57 Issue (2): 299-309    DOI: 10.3785/j.issn.1008-973X.2023.02.010
计算机技术     
高阶互信息最大化与伪标签指导的深度聚类
刘超1(),孔兵1,*(),杜国王2,周丽华1,陈红梅1,包崇明3
1. 云南大学 信息学院,云南 昆明 650504
2. 云南大学 西南天文研究所,云南 昆明 650504
3. 云南大学 软件学院,云南 昆明 650504
Deep clustering via high-order mutual information maximization and pseudo-label guidance
Chao LIU1(),Bing KONG1,*(),Guo-wang DU2,Li-hua ZHOU1,Hong-mei CHEN1,Chong-ming BAO3
1. School of Information Science and Engineering, Yunnan University, Kunming 650504, China
2. South-Western Institute For Astronomy Research, Yunnan University, Kunming 650504, China
3. School of Software, Yunnan University, Kunming 650504, China
 全文: PDF(2168 KB)   HTML
摘要:

针对现有聚类方法未充分探索图的拓扑结构和节点关系,且无法受益于模型预测的不精确标签的问题,提出一种高阶互信息最大化与伪标签指导的深度聚类模型HMIPDC. 该模型采用高阶互信息最大化策略来最大化图的全局表示、节点表示、节点属性信息之间的互信息. 通过一种结合多跳邻近矩阵的自注意力机制更加合理地提取节点的低维表征. 使用基于深度散度的聚类损失函数(DDC)迭代优化聚类目标,抽取高置信度的预测标签对低维表征的学习进行监督. 在4个基准数据集上的聚类任务、实验时间分析和聚类可视化分析充分表明,HMIPDC的聚类性能始终优于大多数的深度聚类方法. 通过消融研究和参数敏感性分析验证了该模型的有效性和稳定性.

关键词: 自监督学习深度聚类自注意力机制高阶互信息伪标签    
Abstract:

A high-order mutual information maximization and pseudo-label guided deep clustering model, HMIPDC, was proposed to solve the problem that the existing clustering methods cannot fully explore the topological structure and node relationships of the graph, and cannot benefit from inaccurate labels predicted by the model. The high-order mutual information maximization strategy was adopted to maximize the mutual information among the global representation of the graph, node representation, and node attribute information. Low-dimensional representations of nodes were extracted more reasonably through a self-attention mechanism combined with multi-hop proximity matrices. A deep divergence-based clustering loss function (DDC) was used to iteratively optimize the clustering objective while high confidence predicted labels were utilized to supervise the learning of low-dimensional representations. Experimental results of clustering tasks, experimental time analysis and clustering visualization analysis on four benchmark datasets show that the clustering performance of HMIPDC is always better than that of most deep clustering methods. The effectiveness and the stability of the model were also verified by ablation study and parameter sensitivity analysis.

Key words: self-supervised learning    deep clustering    self-attention mechanism    high-order mutual information    pseudo-label
收稿日期: 2022-07-30 出版日期: 2023-02-28
CLC:  TP 391  
基金资助: 国家自然科学基金资助项目(62062066, 61762090, 31760152, 61966036, 62266050, 62276227); 2022年云南省基础研究计划重点项目(202201AS070015); 云南省中青年学术和技术带头人后备人才项目(202205AC160033)
通讯作者: 孔兵     E-mail: chaoliu@mail.ynu.edu.cn;kongbing@ynu.edu.cn
作者简介: 刘超(1996 —),男,硕士生,从事数据挖掘、深度聚类研究. orcid.org/0000-0001-5083-6744. E-mail: chaoliu@mail.ynu.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
刘超
孔兵
杜国王
周丽华
陈红梅
包崇明

引用本文:

刘超,孔兵,杜国王,周丽华,陈红梅,包崇明. 高阶互信息最大化与伪标签指导的深度聚类[J]. 浙江大学学报(工学版), 2023, 57(2): 299-309.

Chao LIU,Bing KONG,Guo-wang DU,Li-hua ZHOU,Hong-mei CHEN,Chong-ming BAO. Deep clustering via high-order mutual information maximization and pseudo-label guidance. Journal of ZheJiang University (Engineering Science), 2023, 57(2): 299-309.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2023.02.010        https://www.zjujournals.com/eng/CN/Y2023/V57/I2/299

图 1  HMIPDC 模型示意图
数据集 $n$ $K$ $d$ $\xi $
ACM 3 025 3 1 870 13 128
Citeseer 3 327 6 3 703 4 552
DBLP 4 057 4 334 3 528
AMAP 7 650 8 745 119 081
表 1  4个基准数据集的统计信息
%
数据集 方法 ACC NMI ARI F1
ACM K-means 67.31±0.71 32.44±0.46 30.60±0.69 67.57±0.74
AE 81.83±0.08 49.30±0.16 59.64±0.16 82.01±0.08
IDEC 85.12±0.52 56.61±1.16 62.16±1.50 85.11±0.48
GAE 84.52±1.44 55.38±1.92 59.46±3.10 84.65±1.33
DAEGC 88.24±0.02 63.01±0.04 68.70±0.04 88.11±0.01
SDCN 90.45±0.18 68.31±0.25 73.91±0.40 90.42±0.19
DFCN 90.90±0.20 64.40±0.40 74.90±0.40 90.80±0.20
DCRN 91.93±0.20 71.56±0.61 77.56±0.52 91.94±0.20
HMIPDC 92.12±0.15 72.18±0.52 78.04±0.39 92.13±0.14
Citeseer K-means 39.32±3.17 16.94±3.22 13.43±3.02 36.08±3.53
AE 57.08±0.13 27.64±0.08 29.31±0.14 53.80±0.11
IDEC 60.49±1.42 27.17±2.40 25.70±2.65 61.62±1.39
GAE 61.35±0.80 34.63±0.65 33.55±1.18 57.36±0.82
DAEGC 64.90±0.07 38.71±0.08 39.21±0.09 59.56±0.06
SDCN 65.96±0.31 38.71±0.32 40.17±0.43 63.62±0.24
DFCN 69.50±0.20 43.90±0.20 45.50±0.30 64.30±0.20
DCRN 70.86±0.18 45.86±0.35 47.64±0.30 65.83±0.21
HMIPDC 71.93±0.31 46.07±0.23 48.28±0.50 66.96±0.26
DBLP K-means 38.65±0.65 11.45±0.38 6.97±0.39 31.92±0.27
AE 51.43±0.35 25.40±0.16 12.21±0.43 52.53±0.36
IDEC 60.31±0.62 31.17±0.50 25.37±0.60 61.33±0.56
GAE 61.21±1.22 30.80±0.91 22.02±1.40 61.41±2.23
DAEGC 67.42±0.38 30.64±0.46 32.79±0.58 66.89±0.37
SDCN 68.05±1.81 39.50±1.34 39.15±2.01 67.71±1.51
DFCN 76.00±0.80 43.70±1.00 47.00±1.50 75.70±0.80
DCRN 79.66±0.25 48.95±0.44 53.60±0.46 79.28±0.26
HMIPDC 80.34±0.16 49.41±0.34 55.39±0.28 79.76±0.32
AMAP K-means 27.22±0.76 13.23±1.33 5.50±0.44 23.96±0.51
AE 48.25±0.08 38.76±0.30 20.80±0.47 47.87±0.20
IDEC 47.62±0.08 37.83±0.08 19.24±0.07 47.20±0.11
GAE 71.57±2.48 62.13±2.79 48.82±4.57 68.08±1.76
DAEGC 75.52±0.01 63.31±0.01 59.98±0.84 70.02±0.01
SDCN 53.44±0.81 44.85±0.83 31.21±1.23 50.66±1.49
DFCN 76.88±0.80 69.21±1.00 59.98±0.84 71.58±0.31
DCRN 79.94±0.13 73.70±0.24 63.69±0.20 73.82±0.12
HMIPDC 80.83±0.78 69.47±0.46 65.23±1.64 75.87±2.08
表 2  HMIPDC和8种基线方法在4个数据集上的聚类结果
%
数据集 方法 ACC NMI ARI F1
ACM AD 91.87±0.12 70.92±0.29 77.49±0.32 91.67±0.29
AD-MI 91.94±0.15 71.50±0.33 77.57±0.36 91.95±0.15
AD-PL 92.07±0.12 72.08±0.23 77.92±0.28 92.08±0.12
AD-MI-PL 92.12±0.15 72.18±0.52 78.04±0.39 92.13±0.14
Citeseer AD 64.80±0.98 39.25±0.92 40.28±0.69 62.86±0.64
AD-MI 70.65±0.89 44.82±0.50 47.47±0.68 66.56±0.38
AD-PL 70.67±0.58 43.90±0.73 46.08±0.91 64.57±0.78
AD-MI-PL 71.93±0.31 46.07±0.23 48.28±0.50 66.96±0.26
DBLP AD 78.30±0.62 46.94±0.45 52.07±0.75 77.72±0.59
AD-MI 79.05±0.35 47.83±0.57 52.96±0.61 78.60±0.34
AD-PL 79.21±0.80 48.66±0.71 54.24±0.53 78.83±0.57
AD-MI-PL 80.34±0.16 49.41±0.34 55.39±0.28 79.76±0.32
AMAP AD 72.67±1.19 61.62±1.15 54.17±1.05 66.25±2.35
AD-MI 76.68±0.98 65.53±0.85 58.84±0.89 73.17±2.47
AD-PL 74.36±0.91 64.11±0.95 57.23±1.36 71.64±2.16
AD-MI-PL 80.83±0.78 69.47±0.46 65.23±1.64 75.87±2.08
表 3  HMIPDC和3种变种算法在4个数据集上的聚类结果
图 2  不同超参数下4个数据集上的聚类准确率
图 3  不同训练时间下5种方法在ACM数据集上的聚类准确率
图 4  不同方法的低维表征在ACM数据集上的2维可视化结果
1 ZHAN Z H, LI J Y, ZHANG J Evolutionary deep learning: a survey[J]. Neurocomputing, 2022, 483: 42- 58
doi: 10.1016/j.neucom.2022.01.099
2 LIN Y J, GOU Y B, LIU Z T, et al. COMPLETER: incomplete multi-view clustering via contrastive prediction[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. [s. l. ]: CVPR, 2021: 11174-11183.
3 XIE J Y, GIRSHICK R, FARHADI A. Unsupervised deep embedding for clustering analysis[C]// Proceedings of the 33rd International Conference on Machine Learning. New York: ICML, 2016: 478-487.
4 GUO X F, GAO L, LIU X W, et al. Improved deep embedded clustering with local structure preservation[C]// Proceedings of the 26th International Joint Conference on Artificial Intelligence. Melbourne: IJCAI, 2017: 1753-1759.
5 ZHANG S S, LIU J W, ZUO X, et al Online deep learning based on auto-encoder[J]. Applied Intelligence, 2021, 51 (8): 5420- 5439
doi: 10.1007/s10489-020-02058-8
6 WU Z H, PAN S R, CHEN F W, et al A comprehensive survey on graph neural networks[J]. IEEE Transactions on Neural Networks and Learning Systems, 2021, 32 (1): 4- 24
doi: 10.1109/TNNLS.2020.2978386
7 KIPF T N, WELLING M. Semi-supervised classification with graph convolutional networks[C]// 5th International Conference on Learning Representations. Toulon: ICLR, 2017: 1-14.
8 WANG C, PAN S R, HU R Q, et al. Attributed graph clustering: a deep attentional embedding approach[C]// Proceedings of the 28th International Joint Conference on Artificial Intelligence. Macao: IJCAI, 2019: 3670-3676.
9 BO D Y, WANG X, SHI C, et al. Structural deep clustering network[C]// Proceedings of the Web Conference 2020. Taipei: WWW, 2020: 1400-1410.
10 杜国王, 周丽华, 王丽珍, et al 基于两级权重的多视角聚类[J]. 计算机研究与发展, 2022, 59 (4): 907- 921
DU Guo-wang, ZHOU Li-hua, WANG Li-zhen, et al Multi-view clustering based on two-level weights[J]. Journal of Computer Research and Development, 2022, 59 (4): 907- 921
doi: 10.7544/issn1000-1239.20200897
11 KAMPFFMEYER M, LØKSE S, BIANCHI F M, et al Deep divergence-based approach to clustering[J]. Neural Networks, 2019, 113: 91- 101
doi: 10.1016/j.neunet.2019.01.015
12 MOLAEI S, BOUSEJIN N G, ZARE H, et al Deep node clustering based on mutual information maximization[J]. Neurocomputing, 2021, 455: 274- 282
doi: 10.1016/j.neucom.2021.03.020
13 陈亦琦, 钱铁云, 李万理, et al 基于复合关系图卷积的属性网络嵌入方法[J]. 计算机研究与发展, 2020, 57 (8): 1674- 1682
CHEN Yi-qi, QIAN Tie-yun, LI Wan-li, et al Exploiting composite relation graph convolution for attributed network embedding[J]. Journal of Computer Research and Development, 2020, 57 (8): 1674- 1682
doi: 10.7544/issn1000-1239.2020.20200206
14 KIPF T N, WELLING M. Variational graph auto-encoders[C]// Bayesian Deep Learning Workshop on 30th Conference on Neural Information Processing Systems. Barcelona: NIPS, 2016.
15 KOU S W, XIA W, ZHANG X D, et al Self-supervised graph convolutional clustering by preserving latent distribution[J]. Neurocomputing, 2021, 437: 218- 226
doi: 10.1016/j.neucom.2021.01.082
16 TU W X, ZHOU S H, LIU X W, et al. Deep fusion clustering network[C]// 35th AAAI Conference on Artificial Intelligence. [s.l.]: AAAI, 2021: 9978-9987.
17 LIU Y, TU W X, ZHOU S H, et al. Deep graph clustering via dual correlation reduction[C]// 36th Conference on Artificial Intelligence. Vancouver: AAAI, 2022: 7603-7611.
18 BELGHAZI M I, BARATIN A, RAJESWAR S, et al. MINE: mutual information neural estimation[C]// Proceedings of the 35th International Conference on Machine Learning. Stockholm: ICML, 2018: 531-540.
19 HJELM R D, FEDOROV A, LAVOIE-MARCHILDON S, et al. Learning deep representations by mutual information estimation and maximization[C]// 7th International Conference on Learning Representations. New Orleans: ICLR, 2019.
20 VELIČKOVIĆ P, FEDUS W, HAMILTON W L, et al. Deep graph infomax[C]// 7th International Conference on Learning Representations. New Orleans: ICLR, 2019.
21 JING B Y, PARK C Y, TONG H H. HDMI: high-order deep multiplex infomax[C]// Proceedings of the Web Conference 2021. New York: WWW, 2021: 2414-2424.
22 MCGILL W J Multivariate information transmission[J]. Transactions of the IRE Professional Group on Information Theory, 1954, 4 (4): 93- 111
doi: 10.1109/TIT.1954.1057469
23 VELIČKOVIĆ P, CUCURULL G, CASANOVA A, et al. Graph attention networks[C]// 5th International Conference on Learning Representations. Toulon: ICLR, 2017.
24 RIZVE M N, DUARTE K, RAWAT Y S, et al. In defense of pseudo-labeling: an uncertainty-aware pseudo-label selection framework for semi-supervised learning[C]// 9th International Conference on Learning Representations. [s. l. ]: ICLR, 2021.
25 HARTIGAN J A, WONG M A A K-means clustering algorithm[J]. Journal of the Royal Statistical Society Series C Applied Statistics, 1979, 28 (1): 100- 108
26 ZHAO H, YANG X, WANG Z R, et al. Graph debiased contrastive learning with joint representation clustering[C]// Proceedings of the 30th International Joint Conference on Artificial Intelligence. [s.l.]: IJCAI, 2021: 3434-3440.
27 LV J C, KANG Z, LU X, et al Pseudo-supervised deep subspace clustering[J]. IEEE Transactions on Image Processing, 2021, 30: 5252- 5263
doi: 10.1109/TIP.2021.3079800
28 BOUYER A, ROGHANI H LSMD: a fast and robust local community detection starting from low degree nodes in social networks[J]. Future Generation Computer Systems, 2020, 113: 41- 57
doi: 10.1016/j.future.2020.07.011
29 KINGMA D P, BA J. Adam: a method for stochastic optimization[C]// 3rd International Conference for Learning Representations. San Diego: ICLR, 2015.
30 MAATENL V D, HINTON G Visualizing data using t-SNE[J]. Journal of Machine Learning Research, 2008, 9: 2579- 2605
31 周丽华, 王家龙, 王丽珍, 等 异质信息网络表征学习综述[J]. 计算机学报, 2022, 45 (1): 160- 189
ZHOU Li-hua, WANG Jia-long, WANG Li-zhen, et al Heterogeneous information network representation learning: a survey[J]. Chinese Journal of Computers, 2022, 45 (1): 160- 189
doi: 10.11897/SP.J.1016.2022.00160
[1] 周天琪,杨艳,张继杰,殷少伟,郭增强. 基于无负样本损失和自适应增强的图对比学习[J]. 浙江大学学报(工学版), 2023, 57(2): 259-266.
[2] 鞠晓臣,赵欣欣,钱胜胜. 基于自注意力机制的桥梁螺栓检测算法[J]. 浙江大学学报(工学版), 2022, 56(5): 901-908.
[3] 刘英莉,吴瑞刚,么长慧,沈韬. 铝硅合金实体关系抽取数据集的构建方法[J]. 浙江大学学报(工学版), 2022, 56(2): 245-253.
[4] 于楠晶,范晓飚,邓天民,冒国韬. 基于多头自注意力的复杂背景船舶检测算法[J]. 浙江大学学报(工学版), 2022, 56(12): 2392-2402.