Please wait a minute...
浙江大学学报(工学版)  2019, Vol. 53 Issue (3): 548-554    DOI: 10.3785/j.issn.1008-973X.2019.03.016
计算机技术     
基于传播概率矩阵的异构信息网络表示学习
赵廷廷(),王喆*(),卢奕南
吉林大学 计算机科学与技术学院 符号计算与知识工程教育部重点实验室,吉林 长春 130012
Heterogeneous information network representation learning based on transition probability matrix (HINtpm)
Ting-ting ZHAO(),zhe WANG*(),Yi-nan LU
College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Jilin University, Changchun 130012, China
 全文: PDF(860 KB)   HTML
摘要:

根据元路径和可交换矩阵,结合节点一阶和二阶相似性得到最后的传播概率矩阵;利用降噪自动编码器对传播概率矩阵进行降维得到异构信息网络的节点表示;将异构信息网络的节点表示用梯度提升树(GBDT)分类,得到不同百分比训练集下的分类准确率,用聚类指标标准化互信息(NMI)评价聚类效果,用T-SNE展现可视化效果. 在数据集DBLP和AMiner上分别进行实验,相比DeepWalk、node2vec和metapath2vec方法,在应用任务节点分类上,所提出的基于传播概率矩阵的异构信息网络表示学习(HINtpm)的准确率与DeepWalk相比最高提升了24%,聚类指标NMI与DeepWalk相比最高提升了13%.

关键词: 网络表示学习异构信息网络(HIN)传播概率矩阵元路径节点相似性自动编码器    
Abstract:

First, the final probability transition matrix was obtained by combining the first-order and second-order similarity of the nodes, according to the meta-path and the commuting matrix. Then, a Denoisin Auto-encoder was used to reduce the dimension of probability transition matrix for getting the node representation in heterogeneous information network. Finally, the node representation in heterogeneous information network was classified by gradient boosting decision tree (GBDT) and the classification accuracy under different percentage training set was obtained. Use the clustering index normalized mutual information (NMI) to evaluate the clustering effect and use T-SNE to show the visual effect. Experiments were performed on data sets DBLP and AMiner. The proposed heterogeneous information network representation learning based on transition probability matrix (HINtpm) was compared with DeepWalk, node2vec and metapath2vec methods. As results, compared with DeepWalk method, HINtpm improved the classification accuracy by 24% the maximum on the application task-node classify and increased the clustering index NMI by 13% the maximum.

Key words: network representation learning    heterogeneous information network (HIN)    transition probability matrix    meta-path    nodes' similarity    auto-encoder
收稿日期: 2018-02-05 出版日期: 2019-03-04
CLC:  TP 391  
通讯作者: 王喆     E-mail: zhaott16@mails.jlu.edu.cn;wz2000@jlu.edu.cn
作者简介: 赵廷廷(1993—),女,硕士生,从事网络表示学习、数据挖掘、社交网络研究. orcid.org/0000-0002-6870-6128. E-mail: zhaott16@mails.jlu.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
赵廷廷
王喆
卢奕南

引用本文:

赵廷廷,王喆,卢奕南. 基于传播概率矩阵的异构信息网络表示学习[J]. 浙江大学学报(工学版), 2019, 53(3): 548-554.

Ting-ting ZHAO,zhe WANG,Yi-nan LU. Heterogeneous information network representation learning based on transition probability matrix (HINtpm). Journal of ZheJiang University (Engineering Science), 2019, 53(3): 548-554.

链接本文:

http://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2019.03.016        http://www.zjujournals.com/eng/CN/Y2019/V53/I3/548

图 1  典型异构信息网络(HIN)举例:DBLP
图 2  DBLP数据集中的元模式
图 3  DBLP元路径实例:A-P-C-P-A
图 4  同构信息网络中的一阶、二阶相似性
图 5  降噪自动编码器模型
算法 R=10% R=20% R=30% R=40% R=50% R=60% R=70% R=80% R=90%
DeepWalk 0.655 0 0.713 9 0.754 8 0.743 9 0.780 7 0.768 0 0.788 4 0.795 6 0.786 2
Node2vec 0.653 5 0.735 5 0.764 4 0.772 7 0.783 1 0.761 7 0.781 5 0.786 2 0.817 6
Pathsim 0.790 2 0.816 8 0.827 5 0.826 7 0.824 5 0.833 1 0.833 4 0.830 2 0.888 1
Metapath2vec 0.866 2 0.896 9 0.888 4 0.896 0 0.897 6 0.907 1 0.911 8 0.920 1 0.880 5
HINtpm 0.893 6 0.914 3 0.926 7 0.929 8 0.929 9 0.937 1 0.949 6 0.944 3 0.941 5
HINtpm0 0.897 3 0.925 3 0.935 3 0.926 1 0.938 1 0.945 0 0.945 2 0.939 0 0.968 6
表 1  不同算法在DBLP数据集上的分类准确率
算法 R=10% R=20% R=30% R=40% R=50% R=60% R=70% R=80% R=90%
DeepWalk 0.858 4 0.873 0 0.888 5 0.898 4 0.907 4 0.910 6 0.915 2 0.910 0 0.902 5
Node2vec 0.888 6 0.914 6 0.923 0 0.930 4 0.940 4 0.944 6 0.943 7 0.932 8 0.935 0
pathsim 0.954 4 0.965 8 0.997 17 0.971 5 0.972 1 0.974 8 0.979 2 0.980 0 0.982 5
Metapath2vec 0.898 5 0.928 5 0.935 2 0.937 6 0.935 2 0.939 9 0.940 5 0.953 9 0.944 0
HINtpm 0.947 5 0.961 4 0.960 3 0.963 7 0.961 0 0.963 8 0.969 7 0.975 4 0.984 3
HINtpm0 0.963 6 0.967 6 0.969 7 0.976 8 0.976 7 0.975 6 0.983 5 0.985 9 0.985 0
表 2  不同算法在AMiner数据集上的分类准确率
算法 NMI-DBLP NMI- AMiner
DeepWalk 0.483 4 0.781 8
Node2vec 0.499 0 0.831 5
Metapath2vec 0.570 7 0.595 8
pathsim 0.334 1 0.600 4
HINtpm 0.612 9 0.831 7
HINtpm0 0.608 6 0.802 0
表 3  不同算法在DBLP和AMiner数据集上的标准化互信息(NMI)对比
图 6  不同算法在AMiner数据集上的可视化效果
1 涂存超, 杨成, 刘知远, 等 网络表示学习综述[J]. 中国科学: 信息科学, 2017, 47 (8): 980- 996
TU Cun-Chao, YANG Cheng, LIU Zhi-Yuan, at el Network representation learning: an overview[J]. SCIENTIA SINICA: Informations, 2017, 47 (8): 980- 996
2 ROWEIS S T, SAUL L K Nonlinear dimensionality reduction by locally linear embedding[J]. Science, 2000, 290 (5500): 2323- 2326
doi: 10.1126/science.290.5500.2323
3 COX T F, COX M A A. Multidimensional scaling [M]. Boca Raton: CRC Press, 2000: 123–141.
4 BELKIN M, NIYOGI P. Laplacian eigenmaps and spectral techniques for embedding and clustering [C] // Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic. Cambridge: MIT Press, 2001: 585–591.
5 CHEN M, YANG Q, TANG X O. Directed graph embedding [C] // Proceedings of the 20th International Joint Conference on Artificial Intelligence. San Francisco: Morgan Kaufmann Publishers Inc, 2007: 2707–2712.
6 NALLAPATI R, COHEN W W. Link-PLSA-LDA: a new unsupervised model for topics and influence of blogs [C] // Proceedings of the 2nd International Conference on Weblogs and Social Media. Seattle: AAAI, 2008: 84–92.
7 CHANG J, BLEI D M. Relational topic models for document networks [C] // Proceedings of the 12th International Conference on Artificial Intelligence and Statistics. Cambridge: JMLR, 2009: 81–88.
8 LE T M V, LAUW H W. Probabilistic latent document network embedding [C] // Proceedings of the 2014 International Conference on Data Mining. Washington: IEEE Computer Society, 2014: 270–279.
9 WARD C K Word2Vec[J]. Natural Language Engineering, 2016, 23 (1): 155- 162
doi: 10.1017/S1351324916000334
10 TANG J, QU M, WANG M, et al. Line: large-scale information network embedding [C] // Proceedings of the 24th International Conference on World Wide Web. Florence: WWW, 2015: 1067–1077
11 WANG D, CUI P, ZHU W. Structural deep network embedding [C] // Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco: ACM, 2016: 1225–1234
12 TANG J, QU M, MEI Q. PTE: predictive text embedding through large-scale heterogeneous text networks [C] // ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Sydney: ACM, 2015: 1165–1174.
13 CAO S S, LU W, XU Q K. GraRep: learning graph representations with global structural information [C] // ACM International on Conference on Information and Knowledge Management. Melbourne: ACM, 2015: 891–900.
14 YANG J, LESKOVEC J. Overlapping community detection at scale: a nonnegative matrix factorization approach [C] // Proceedings of the 6th ACM International Conference on Web Search and Data Mining. Rome: ACM, 2013: 587–596
15 LEY M. The DBLP Computer science bibliography: evolution, research issues, perspectives [C] // String Processing and Information Retrieval, International Symposium. Portugal: SPIRE, 2002: 1–10.
16 TANG J, ZHANG J, YAO L M, et al. ArnetMiner: extraction and mining of academic social networks [C] // ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Las Vegas: SIGKDD, 2008: 990–998.
17 CHANG S Y, HAN W, TANG J L, et al. Heterogeneous network embedding via deep architectures [C] // ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Sydney: ACM, 2015: 119–128.
18 JI M, HAN J W, DANILEVSKY M. Ranking-based classification of heterogeneous information networks [C] // ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Diego: ACM, 2011: 1298–1306.
19 DONG Y, CHAWLA N V, SWAMI A. Metapath2vec: Scalable representation learning for heterogeneous networks [C] // Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Nova Scotia: ACM, 2017: 135–144.
20 PEROZZI B, ALRFOU R, SKIENA S. DeepWalk: online learning of social representations [C]// ACM Sigkdd International Conference on Knowledge Discovery & Data Mining. New York: SIGKDD, 2014: 701-710.
21 GROVER A, LESKOVEC J. node2vec: scalable feature learning for networks [C] // ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco: SIGKDD, 2016: 855–864.
[1] 张雪峰,陈秀莉,僧德文. 融合用户信任和影响力的top-N推荐算法[J]. 浙江大学学报(工学版), 2020, 54(2): 311-319.