Framework of feature fusion and distribution with mixture of experts for parallel recommendation algorithm
Zhe YANG1,2(),Hong-wei GE1,2,*(),Ting LI1,2
1. School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China 2. Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computational Intelligence, Wuxi 214122, China
A mixture of experts parallel recommendation algorithm framework which combined feature fusion and distribution was proposed in order to address the issues of parameter sharing and high computational costs in click-through rate prediction. The ability of parallel architecture can be improved to distinguish different types of features and learn more expressive feature inputs, and parameters between explicit and implicit features can be shared. The gradients during backpropagation were mitigated and the performance of the model was improved. The framework is lightweight and model-agnostic, and can be generalized to a variety of mainstream parallel recommendation algorithms. Extensive experimental results on three public datasets demonstrate that the algorithm framework can be used to effectively improve the performance of SOTA models.
Zhe YANG,Hong-wei GE,Ting LI. Framework of feature fusion and distribution with mixture of experts for parallel recommendation algorithm. Journal of ZheJiang University (Engineering Science), 2023, 57(7): 1317-1325.
Fig.1Illustration of sequential and parallel architecture
Fig.2Illustration overall architecture diagram of ME-PRAF
Fig.3Internal structure of Broker module
数据集
M/106
F
C/106
Criteo
45
39
33
Avazu
40
23
9.4
Movielens-1M
0.74
7
0.013
Tab.1Parameters of three datasets in experiment
模型
Criteo
Avazu
MovieLens-1M
AUC
LogLoss
AUC
LogLoss
AUC
LogLoss
DeepFM
0.8007
0.4508
0.7852
0.3780
0.8932
0.3202
DCN
0.8099
0.4419
0.7905
0.3744
0.8935
0.3197
xDeepFM
0.8052
0.4418
0.7894
0.3794
0.8923
0.3251
AutoInt+
0.8083
0.4434
0.7774
0.3811
0.8488
0.3753
DCN-v2
0.8115
0.4406
0.7907
0.3742
0.8964
0.3160
EDCN
0.8001
0.5415
0.7793
0.3803
0.8722
0.3469
CowClip
0.8097
0.4420
0.7906
0.3740
0.8961
0.3174
本文方法
0.8122
0.4398
0.7928
0.3732
0.8970
0.3163
Tab.2Performance comparisons between ME-DCN and other SOTA models in three datasets
模型
$N_{\rm{p}} /10^6 $
DeepFM
1.4
DCN
3.1
xDeepFM
4.2
AutoInt+
3.7
DCN-v2
7.2
EDCN
11
本文方法
5.7
Tab.3Number of parameters comparison between ME-DCN and other models (Criteo)
模型
Criteo
Avazu
MovieLens-1M
AUC
LogLoss
AUC
LogLoss
AUC
LogLoss
DCN
0.8099
0.4419
0.7905
0.3744
0.8935
0.3197
DCNME
0.8116
0.4403
0.7919
0.3731
0.8962
0.3174
AutoInt+
0.8083
0.4434
0.7774
0.3811
0.8488
0.3753
AutoInt+ME
0.8104
0.4414
0.7899
0.3737
0.8928
0.3250
DCN-v2
0.8115
0.4406
0.7907
0.3742
0.8964
0.3160
DCN-v2ME
0.8122
0.4398
0.7928
0.3732
0.8970
0.3163
Tab.4Performance comparison of SOTA parallel architecture models after using ME-PRAF on three datasets
模型
AUC
LogLoss
ME-DCN -w/o FB
0.811 7
0.440 3
ME-DCN -w/o EB
0.811 3
0.440 7
ME-DCN
0.812 2
0.439 8
Tab.5Ablation study of Broker modules in ME-DCN(Criteo)
模型
AUC
LogLoss
ME-DCN -w/ concat
0.812 2
0.439 8
ME-DCN -w/ add
0.810 5
0.441 8
ME-DCN -w/ Hardmard
0.810 7
0.441 6
Tab.6Performance comparison of various fusion types in Fusion module in ME-DCN (Criteo)
Fig.4Analysis of diversity factor of feature weight of Broker module
[1]
KHAWAR F, HANG X, TANG R, et al. Autofeature: searching for feature interactions and their architectures for click-through rate prediction [C]// Proceedings of the 29th ACM International Conference on Information and Knowledge Management. [S. l.]: ACM, 2020: 625-634.
[2]
HU D, WANG C, NIE F, et al. Dense multimodal fusion for hierarchically joint representation [C]// 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton: IEEE, 2019: 3941-3945.
[3]
CHENG H T, KOC L, HARMSEN J, et al. Wide and deep learning for recommender systems [C]// Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. Boston: ACM, 2016: 7-10.
[4]
RENDLE S. Factorization machines [C]// IEEE International Conference on Data Mining. Sydney: IEEE, 2010: 995-1000.
[5]
GUO H, TANG R, YE Y, et al. DeepFM: a factorization-machine based neural network for CTR prediction [C]// Proceedings of the 26th International Joint Conference on Artificial Intelligence. Melbourne: AAAI, 2017: 1725-1731.
[6]
WANG R, SHIVANNA R, CHENG D, et al. DCN v2: improved deep & cross network and practical lessons for web-scale learning to rank systems [C]// Proceedings of the Web Conference. Ljubljana: ACM, 2021: 1785-1797.
[7]
BEUTEL A, COVINGTON P, JAIN S, et al. Latent cross: making use of context in recurrent recommender systems [C]// Proceedings of the 11th ACM International Conference on Web Search and Data Mining. Marina Del Rey: ACM, 2018: 46-54.
[8]
QU Y, FANG B, ZHANG W, et al Product-based neural networks for user response prediction over multi-field categorical data[J]. ACM Transactions on Information Systems, 2018, 37 (1): 1- 35
[9]
ZHOU G, ZHU X, SONG C, et al. Deep interest network for click-through rate prediction [C]// Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. London: ACM, 2018: 1059-1068.
[10]
ZHOU G, MOU N, FAN Y, et al. Deep interest evolution network for click-through rate prediction[C]// Proceedings of the AAAI Conference on Artificial Intelligence. California: AAAI, 2019: 5941-5948.
[11]
WANG R, FU B, FU G, et al. Deep & cross network for ad click predictions [M]// Proceedings of the ADKDD'17. Halifax: ACM, 2017: 1-7.
[12]
SONG W, SHI C, XIAO Z, et al. Autoint: automatic feature interaction learning via self-attentive neural networks [C]// Proceedings of the 28th ACM International Conference on Information and Knowledge Management. Beijing: ACM, 2019: 1161-1170.
[13]
LIAN J, ZHOU X, ZHANG F, et al. xdeepfm: combining explicit and implicit feature interactions for recommender systems [C]// Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. London: ACM, 2018: 1754-1763.
[14]
HUANG T, SHE Q, WANG Z, et al GateNet: gating-enhanced deep network for click-through rate prediction[J]. ArXiv, 2020, 7 (1): 1- 7
[15]
CHEN B, WANG Y, LIU Z, et al. Enhancing explicit and implicit feature interactions via information sharing for parallel deep CTR models [C]// Proceedings of the 30th ACM International Conference on Information and Knowledge Management. Queensland: ACM, 2021: 3757-3766.
[16]
MA J, ZHAO Z, YI X, et al. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts [C]// Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. London: ACM, 2018: 1930-1939.
[17]
HOLMES N P, SPENCE C Multisensory integration: space, time and superadditivity[J]. Current Biology, 2005, 15 (18): R762- R764
doi: 10.1016/j.cub.2005.08.058
[18]
ZHENG Z, XU P, ZOU X, et al CowClip: reducing CTR prediction model training time from 12 hours to 10 minutes on 1 GPU[J]. ArXiv, 2022, 4 (1): 1- 18
[19]
KINGMA D P, BA J Adam: a method for stochastic optimization[J]. ArXiv, 2014, 12 (1): 1- 13
[20]
HE K, ZHANG X, REN S, et al. Delving deep into rectifiers: surpassing human-level performance on imagenet classification [C]// Proceedings of the IEEE International Conference on Computer Vision. Santiago: IEEE, 2015: 1026-1034.
Jun-chi MA,Xiao-xin DI,Zong-tao DUAN,Lei TANG. Survey on program representation learning[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(1): 155-169.