Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2022, Vol. 56 Issue (5): 1006-1016    DOI: 10.3785/j.issn.1008-973X.2022.05.018
    
Face reconstruction from voice based on age-supervised learning and face prior information
Li HE(),Shan-min PANG*()
School of Software Engineering, Xi’an Jiaotong University, Xi’an 710049, China
Download: HTML     PDF(3274KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

Previous voice-face image reconstruction methods lack effective supervised constraints from different dimensions and face prior information, which may lead to a low similarity between reconstructed and real images. Thus, a face reconstruction method based on age-supervised learning and face prior information was proposed. Age related data were provided for the present dataset through a pre-trained age estimation model, which strengthened age supervision. For given voice samples, voice-face cross-modal identity matching was applied to retrieve images similar to real speakers, where the retrieved results were considered as face prior information. A joint loss function that consists of the cross entropy loss and the adversarial loss was defined to improve age coincidence, low-frequency content and high-frequency textures of the reconstructed images. Results of face retrieval experiments conducted with dataset Voxceleb 1 showed that the proposed method can improve the similarity between generated and ground truth images. The images generated by the proposed method have better subjective and objective evaluation results than that of the compared methods.



Key wordsdeep learning      image reconstruction      convolutional neural network      generative adversarial network      face prior information     
Received: 19 November 2021      Published: 31 May 2022
CLC:  TP 391  
Fund:  国家自然科学基金资助项目(61972312);陕西省重点研发计划一般工业资助项目(2020GY-002)
Corresponding Authors: Shan-min PANG     E-mail: heliwushi@qq.com;pangsm@xjtu.edu.cn
Cite this article:

Li HE,Shan-min PANG. Face reconstruction from voice based on age-supervised learning and face prior information. Journal of ZheJiang University (Engineering Science), 2022, 56(5): 1006-1016.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2022.05.018     OR     https://www.zjujournals.com/eng/Y2022/V56/I5/1006


结合年龄监督和人脸先验的语音-人脸图像重建

针对语音-人脸图像重建方法缺乏来自不同维度的监督约束及未利用人脸先验信息,导致生成图像和真实图像相似度不高的问题,提出结合年龄监督和人脸先验信息的语音-人脸图像重建方法. 通过预训练的年龄评估模型为当前数据集扩充年龄数据,弥补来自年龄监督信息的缺乏. 通过语音-人脸图像跨模态身份匹配方法,为给定语音检索接近真实人脸的面部图像,将得到的图像作为人脸先验信息使用. 该方法通过定义结合交叉熵损失和对抗损失的联合损失函数,从年龄感、低频内容和局部纹理等方面均衡提升重建图像质量. 基于数据集Voxceleb 1,通过人脸检索实验的方式进行测试,与当前主流方法进行比较和分析. 结果表明,该方法能有效提升生成图像与真实图像的相似度,所生成的图像具有更好的主客观评价结果.


关键词: 深度学习,  图像重建,  卷积神经网络,  生成对抗网络,  人脸先验信息 
Fig.1 Illustration of face reconstruction from voice based on age-supervised learning and face prior information
Fig.2 Architecture of cross-modal matching network with cross-entropy loss
Fig.3 Architecture of speech encoder network
Fig.4 Architecture of generative adversarial network with auxiliary classifiers
Fig.5 Architecture of generative network
Fig.6 Architecture of discriminator module with identity classifier and age classifier
Fig.7 Architecture of image fusion network with skip connection
模型 距离度量 ResNet-50 VGG-16 FID
Top-1/% Top-5/% Top-10/% Top-1/% Top-5/% Top-10/%
random ? 0.53 1.30 2.17 0.53 1.30 2.17 ?
Speech2Face[13] L1 0.61 2.59 4.44 0.58 2.96 5.45 233.92
cos 0.56 2.59 4.60 0.69 3.31 5.93
Voice2Face[15] L1 1.88 5.21 8.47 1.30 6.06 10.79 51.45
cos 1.98 5.58 8.33 1.32 5.66 10.90
仅生成模块 L1 2.30 5.71 8.33 1.38 6.46 11.08 38.60
cos 2.25 5.77 8.49 1.69 6.48 11.53
本研究方法 L1 2.59 5.98 9.20 1.75 6.60 11.60 40.32
cos 2.32 5.81 9.17 1.71 6.56 11.58
Tab.1 Experimental results of proposed method compared with popular methods
Fig.8 Experimental image details of proposed method compared with popular methods
模型 距离度量 Top-1/% Top-5/% Top-10/% FID
random ? 0.53 1.30 2.17 ?
不用年龄数据 L1 1.80 4.89 7.86 37.17
cos 1.80 4.84 7.65
使用年龄数据 L1 2.30 5.71 8.33 38.60
cos 2.25 5.77 8.49
Tab.2 Comparison results of different settings about age in ablation study
Fig.9 Comparison of generated images of networks using different settings about age
模型 距离度量 Top-1/% Top-5/% Top-10/% FID
random ? 0.53 1.30 2.17 ?
仅生成模块 L1 2.30 5.71 8.33 38.60
cos 2.25 5.77 8.49
去除检索模块 L1 1.67 4.55 7.25 54.87
cos 1.77 4.63 7.28
随机检索模块 L1 1.53 4.95 7.67 57.85
cos 1.53 4.81 7.65
完整模型 L1 2.59 5.98 9.20 40.32
cos 2.32 5.81 9.17
Tab.3 Comparison results of different settings about retrieval module in ablation study
Fig.10 Comparison of generated images of networks using different settings about retrieval module
[1]   孙颖, 胡艳香, 张雪英, 等 面向情感语音识别的情感维度PAD预测[J]. 浙江大学学报:工学版, 2019, 53 (10): 2041- 2048
SUN Ying, HU Yan-xiang, ZHANG Xue-ying, et al Prediction of emotional dimensions PAD for emotional speech recognition[J]. Journal of Zhejiang University: Engineering Science, 2019, 53 (10): 2041- 2048
[2]   SINGH R, RAJ B, GENCAGA D. Forensic anthropometry from voice: an articulatory-phonetic approach [C]// 2016 39th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO). Opatija: IEEE, 2016: 1375-1380.
[3]   李江, 赵雅琼, 包晔华 基于混沌和替代数据法的中风病人声音分析[J]. 浙江大学学报:工学版, 2015, 49 (1): 36- 41
LI Jiang, ZHAO Ya-qiong, BAO Ye-hua Voice processing technique for patients with stroke based on chao theory and surrogate data analysis[J]. Journal of Zhejiang University: Engineering Science, 2015, 49 (1): 36- 41
[4]   BELIN P, FECTEAU S, BEDARD C Thinking the voice: neural correlates of voice perception[J]. Trends in Cognitive Sciences, 2004, 8 (3): 129- 135
[5]   KAMACHI M, HILL H, LANDER K, et al Putting the face to the voice’: matching identity across modality[J]. Current Biology, 2003, 13 (19): 1709- 1714
[6]   GOODFELLOW I J, POUGET A J, MIRZA M, et al Generative adversarial networks[J]. Advances in Neural Information Processing Systems, 2014, 3: 2672- 2680
[7]   MIRZA M, OSINDERO S. Conditional generative adversarial nets [EB/OL]. (2014-11-06). https://arxiv.org/pdf/1411.1784.pdf.
[8]   YU Y, GONG Z, ZHONG P, et al. Unsupervised representation learning with deep convolutional neural network for remote sensing images [C]// International Conference on Image and Graphics. Shanghai: Springer, 2017: 97-108.
[9]   ISOLA P, ZHU J Y, ZHOU T, et al. Image-to-image translation with conditional adversarial networks [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Hawaii: IEEE, 2017: 1125-1134.
[10]   ZHU J Y, PARK T, ISOLA P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks [C]// Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2223-2232.
[11]   王凯, 岳泊暄, 傅骏伟, 等 基于生成对抗网络的图像恢复与SLAM容错研究[J]. 浙江大学学报: 工学版, 2019, 53 (1): 115- 125
WANG Kai, YUE Bo-xuan, FU Jun-wei, et al Image restoration and fault tolerance of stereo SLAM based on generative adversarial net[J]. Journal of Zhejiang University: Engineering Science, 2019, 53 (1): 115- 125
[12]   段然, 周登文, 赵丽娟, 等 基于多尺度特征映射网络的图像超分辨率重建[J]. 浙江大学学报: 工学版, 2019, 53 (7): 1331- 1339
DUAN Ran, ZHOU Deng-wen, ZHAO Li-juan, et al Classification and detection method of blood cells images based on multi-scale conditional generative adversarial network[J]. Journal of Zhejiang University: Engineering Science, 2019, 53 (7): 1331- 1339
[13]   OH T H, DEKEL T, KIM C, et al. Speech2face: learning the face behind a voice [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 7539-7548.
[14]   DUARTE A C, ROLDAN F, TUBAU M, et al. WAV2PIX: speech-conditioned face generation using generative adversarial networks [C]// International Conference on Acoustics, Speech, and Signal Processing. Brighton: IEEE, 2019: 8633-8637.
[15]   WEN Y, RAJ B, SINGH R Face reconstruction from voice using generative adversarial networks[J]. Advances in Neural Information Processing Systems, 2019, 32: 5265- 5274
[16]   ODENA A, OLAH C, SHLENS J. Conditional image synthesis with auxiliary classifier gans [C]// International Conference on Machine Learning. Sydney: ICML, 2017: 2642-2651.
[17]   CHOI H S, PARK C, LEE K. From inference to generation: end-to-end fully self-supervised generation of human face from speech [C]// International Conference on Learning Representations. Addis Ababaa: ICLR, 2020.
[18]   RONNEBRGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation [C]// International Conference on Medical Image Computing and Computer Assisted Intervention. Munich: Springer, 2015: 234–241.
[19]   LI C, WAND M. Precomputed real-time texture synthesis with markovian generative adversarial networks [C]// European Conference on Computer Vision. Amsterdam: Springer, 2016: 702-716.
[20]   CHEN Y, TAI Y, LIU X, et al. FSRNet: end-to-end learning face super-resolution with facial priors [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake: IEEE, 2018: 2492-2501.
[21]   ARANDJELOVIC R, ZISSERMAN A. Look, listen and learn [C]// Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 609-617.
[22]   CASTREJON L, AYTAR Y, VONDRICK C, et al. Learning aligned cross-modal representations from weakly aligned data [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 2940-2949.
[23]   YANG D W, ISMAIL M A, LIU W Y, et al. Disjoint mapping network for crossmodal matching of voices and faces [C]// International Conference on Learning Representations. Addis Ababaa: ICLR, 2018.
[24]   张晓冰, 龚海刚, 杨帆, 等 基于端到端句子级别的中文唇语识别研究[J]. 软件学报, 2020, 31 (6): 1747- 1760
ZHANG Xiao-bing, GONG Hai-gang, YANG Fan, et al Chinese sentence-level lip reading based on end-to-end model[J]. Journal of Software, 2020, 31 (6): 1747- 1760
[25]   HOOVER K, CHAUDHURI S, PANTOFARU C, et al. Putting a face to the voice: fusing audio and visual signals across a video to determine speakers [EB/OL]. (2017-5-31) [2021-10-24]. https://arxiv.org/pdf/1706.00079.pdf.
[26]   唐郅, 侯进 基于深度神经网络的语音驱动发音器官的运动合成[J]. 自动化学报, 2016, 42 (6): 923- 930
TANG Zhi, HOU Jin Speech driven articulator motion synthesis with deep neural networks[J]. Acta Automatica Sinica, 2016, 42 (6): 923- 930
[27]   SUN Y, ZHOU H, LIU Z, et al. Speech2Talking-Face: inferring and driving a face with synchronized audio-visual representation [C]// International Joint Conference on Artificial Intelligence. Montreal: IJCAI, 2021: 1018-1024.
[28]   ZHOU H, SUN Y, WU W, et al. Pose-controllable talking face generation by implicitly modularized audio-visual representation [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Virtual reality: IEEE, 2021: 4176-4186.
[29]   NAGRANI A, ALBANIE S, ZISSERMAN A. Seeing voices and hearing faces: cross-modal biometric matching [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 8427-8436.
[30]   OMKAR M P, ANDREA V, ANDREW Z, et al. Deep face recognition [C]// British Machine Vision Conference. Swansea: BMVC, 2015: 1, 6.
[31]   NAGRANI A, ALBANIE S, ZISSERMAN A. Seeing voices and hearing faces: cross-modal biometric matching [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake: BMVC, 2018: 8427-8436.
[32]   ROTHE R, TIMOFTE R, VAN GOOL L Deep expectation of real and apparent age from a single image without facial landmarks[J]. International Journal of Computer Vision, 2018, 126 (2): 144- 157
[33]   HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.
[34]   KING D E Dlib-ml: a machine learning toolkit[J]. The Journal of Machine Learning Research, 2009, 10: 1755- 1758
[35]   EPHRAT A, MOSSERI I, LANG O, et al Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation[J]. ACM Transactions on Graphics (TOG), 2018, 37 (4): 1- 11
[1] Wen-chao BAI,Xi-xian HAN,Jin-bao WANG. Efficient approximate query processing framework based on conditional generative model[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(5): 995-1005.
[2] Xue-qin ZHANG,Tian-ren LI. Breast cancer pathological image classification based on Cycle-GAN and improved DPN network[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(4): 727-735.
[3] Yun-hao WANG,Ming-hui SUN,Yi XIN,Bo-xuan ZHANG. Robot tactile recognition system based on piezoelectric film sensor[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(4): 702-710.
[4] Jing-hui CHU,Li-dong SHI,Pei-guang JING,Wei LV. Context-aware knowledge distillation network for object detection[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(3): 503-509.
[5] Ruo-ran CHENG,Xiao-li ZHAO,Hao-jun ZHOU,Han-chen YE. Review of Chinese font style transfer research based on deep learning[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(3): 510-519, 530.
[6] Pei-zhi WEN,Jun-mou CHEN,Yan-nan XIAO,Ya-yuan WEN,Wen-ming HUANG. Underwater image enhancement algorithm based on GAN and multi-level wavelet CNN[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(2): 213-224.
[7] Hang-yao TU,Wan-liang WANG,Jia-chen CHEN,Guo-qing LI,Fei WU. Dehazing algorithm combined with atmospheric scattering model based on generative adversarial network[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(2): 225-235.
[8] Tong CHEN,Jian-feng GUO,Xin-zhong HAN,Xue-li XIE,Jian-xiang XI. Visible and infrared image matching method based on generative adversarial model[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(1): 63-74.
[9] Song REN,Qian-wen ZHU,Xin-yue TU,Chao DENG,Xiao-shu WANG. Lining disease identification of highway tunnel based on deep learning[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(1): 92-99.
[10] Xing LIU,Jian-bo YU. Attention convolutional GRU-based autoencoder and its application in industrial process monitoring[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(9): 1643-1651.
[11] Xue-yun CHEN,Xiao-qiao HUANG,Li XIE. Classification and detection method of blood cells images based on multi-scale conditional generative adversarial network[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(9): 1772-1781.
[12] Jia-cheng LIU,Jun-zhong JI. Classification method of fMRI data based on broad learning system[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(7): 1270-1278.
[13] Li-sheng JIN,Qiang HUA,Bai-cang GUO,Xian-yi XIE,Fu-gang YAN,Bo-tao WU. Multi-target tracking of vehicles based on optimized DeepSort[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(6): 1056-1064.
[14] Jia-hui XU,Jing-chang WANG,Ling CHEN,Yong WU. Surface water quality prediction model based on graph neural network[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(4): 601-607.
[15] Hong-li WANG,Bin GUO,Si-cong LIU,Jia-qi LIU,Yun-gang WU,Zhi-wen YU. End context-adaptative deep sensing model with edge-end collaboration[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(4): 626-638.