Text-to-image generation algorithm based on generative adversarial network and coordinate attention mechanism

doi:10.3785/j.issn.1008-973X.2026.06.008

Journal of ZheJiang University (Engineering Science)

2026, Vol. 60

Issue (6): 1213-1220 DOI: 10.3785/j.issn.1008-973X.2026.06.008

Text-to-image generation algorithm based on generative adversarial network and coordinate attention mechanism

Yunhong LI(

),Qiqi ZHANG,Jinni CHEN,Weichong CHEN,Xueping SU,Chengming LIANG

School of Electronics and Information, Xi’an Polytechnic University, Xi’an 710048, China

Download:

HTML

PDF(5446KB) HTML
Export: BibTeX | EndNote (RIS)

Abstract

A text-to-image generation algorithm based on coordinate attention mechanism and generative adversarial network (CAT-GAN) was proposed in order to address the issue of poor diversity and low overall quality in the image generated by adversarial network. The conditional enhancement was used to calculate the mean and covariance matrix of the text feature vector, generating conditional variable to replace the original high-dimensional text feature and solve the sparsity problem. The coordinate attention mechanism was introduced into the residual block of the generator network to form a deep fusion module combined with the coordinate attention mechanism (CA-Block). The long-term dependency relationship of feature between channels can be captured while retaining the precise position of feature and enhancing the representation of the target object. The spatial reconstruction unit was introduced into the discriminator network to form a feature space reconstruction module (SRU-Block). Redundant feature was separated via weight assignment and reconstruction, enhancing the discriminator’s ability to represent feature. The model was tested and verified using the CUB-200, Oxford-102 Flowers and COCO dataset. The experimental results showed that the IS and FID index value of the proposed model (CAT-GAN) were the best compared with models such as StackGAN++, AttnGAN, DAE-GAN, DM-GAN, DT-GAN and DF-GAN. The IS index value reached 5.13, 4.10 and 31.81, and the FID index value reached 14.34, 16.76 and 26.36. The proposed model has better visualization effect, proving the effectiveness of the proposed method.

Key words： text-to-image generation generative adversarial network (GAN) conditional augmentation coordinate attention mechanism affine transformation

Received: 15 July 2025 Published: 06 May 2026

CLC:

TP 391

Fund: 国家自然科学基金青年基金资助项目（62403368）；陕西省自然科学基础研究重点资助项目（2022JZ-35）；陕西省自然科学基础研究资助项目（2024JCYBMS-455）；陕西高校青年创新团队资助项目；西安市“科学家+工程师”团队项目（25KGYB00029）.

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Yunhong LI
	Qiqi ZHANG
	Jinni CHEN
	Weichong CHEN
	Xueping SU
	Chengming LIANG

Cite this article:

Yunhong LI,Qiqi ZHANG,Jinni CHEN,Weichong CHEN,Xueping SU,Chengming LIANG. Text-to-image generation algorithm based on generative adversarial network and coordinate attention mechanism. Journal of ZheJiang University (Engineering Science), 2026, 60(6): 1213-1220.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2026.06.008 OR https://www.zjujournals.com/eng/Y2026/V60/I6/1213

基于生成对抗网络和坐标注意力机制的文本生成图像算法

针对对抗网络生成的图像存在多样性差、总体质量不高的问题，提出基于坐标注意力机制和生成对抗网络的文本生成图像算法(CAT-GAN). 采用条件增强计算文本特征向量的均值和协方差矩阵，生成条件变量代替原高维文本特征，解决稀疏性问题. 将坐标注意力机制引入生成器网络的残差块中，构成结合坐标注意力机制的深度融合模块(CA-Block)，在捕捉通道间特征长期依赖关系的同时，保留特征的精确位置，增强感兴趣对象的表示. 在鉴别器网络中引入空间重构单元，构成特征空间重构模块(SRU-Block). 通过权重分离冗余特征并重构，增强鉴别器对特征的表征能力. 通过CUB-200、Oxford-102 Flowers及COCO数据集，测试并验证模型. 实验结果表明，与StackGAN++、AttnGAN、DAE-GAN、DM-GAN、DT-GAN及DF-GAN等模型相比，所提模型(CAT-GAN)的IS和FID指标值均为最优，IS指标值分别达到5.13、4.10、31.81，FID指标值分别达到14.34、16.76、26.36. 所提模型具有更好的可视化效果，证明了所提方法的有效性.

关键词： 文本生成图像, 生成对抗网络（GAN）, 条件增强, 坐标注意力机制, 仿射变换

Fig.1 Overall architecture diagram of CAT-GAN

Fig.2 Structure diagram of CA-Blocks

Fig.3 Feature space reconstruction module (SRU-Block)

Tab.1 Parameter setting for CAT-GAN experiment

Fig.4 Comparison chart of 256×256-resolution image generated by CAT-GAN model and other models

Fig.5 Image diversity generated by CAT-GAN on CUB-200 dataset

Fig.6 Image diversity generated by CAT-GAN on Oxford-102 dataset

Fig.7 Image diversity generated by CAT-GAN on COCO dataset

Tab.2 Analysis table of evaluation indicator for different method

Tab.3 Ablation experiment result of CAT-GAN on Oxford-102, CUB-200 and COCO dataset


[1]	曹寅, 秦俊平, 马千里, 等文本生成图像研究综述[J]. 浙江大学学报: 工学版, 2024, 58 (2): 219- 238 CAO Yin, QIN Junping, MA Qianli, et al Survey of text-to-image synthesis[J]. Journal of Zhejiang University: Engineering Science, 2024, 58 (2): 219- 238

[2]	李云红, 朱绵云, 任劼, 等改进深度卷积生成式对抗网络的文本生成图像[J]. 北京航空航天大学学报, 2023, 49 (8): 1875- 1883 LI Yunhong, ZHU Mianyun, REN Jie, et al Text-to-image synthesis based on modified deep convolutional generative adversarial network[J]. Journal of Beijing University of Aeronautics and Astronautics, 2023, 49 (8): 1875- 1883 doi: 10.13700/j.bh.1001-5965.2021.0588

[3]	梁成名, 李云红, 李丽敏, 等结合语义分割图的注意力机制文本生成图像[J]. 空军工程大学学报, 2024, 25 (4): 118- 127 LIANG Chengming, LI Yunhong, LI Limin, et al A semantic segmentation graph in combination with attention mechanism text generation images[J]. Journal of Air Force Engineering University, 2024, 25 (4): 118- 127 doi: 10.3969/j.issn.2097-1915.2024.04.016

[4]	李丰, 文益民融合多尺度视觉和文本语义特征的图像描述生成算法[J]. 山东大学学报: 工学版, 2025, 55 (3): 80- 87 LI Feng, WEN Yimin Multi-scale visual and textual semantic feature fusion for image captioning[J]. Journal of Shandong University: Engineering Science, 2025, 55 (3): 80- 87 doi: 10.6040/j.issn.1672-3961.0.2024.018

[5]	周刚, 李捍东, 陈烨烨基于对比学习的文本生成图像[J]. 软件工程, 2025, 28 (2): 37- 41 ZHOU Gang, LI Handong, CHEN Yeye Text-to-image generation based on contrastive learning[J]. Software Engineering, 2025, 28 (2): 37- 41

[6]	ZHANG H, XU T, LI H, et al. StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks [C]//Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 5908–5916.

[7]	XU T, ZHANG P, HUANG Q, et al. AttnGAN: fine-grained text to image generation with attentional generative adversarial networks [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 1316–1324.

[8]	TAO M, TANG H, WU F, et al. DF-GAN: a simple and effective baseline for text-to-image synthesis [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 16494–16504.

[9]	YE S, WANG H, TAN M, et al Recurrent affine transformation for text-to-image synthesis[J]. IEEE Transactions on Multimedia, 2024, 26: 462- 473 doi: 10.1109/TMM.2023.3266607

[10]	HÖLLEIN L, BOŽIČ A, MÜLLER N, et al. ViewDiff: 3D-consistent image generation with text-to-image models [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 5043–5052.

[11]	SHIRAKAWA T, UCHIDA S. NoiseCollage: a layout-aware text-to-image diffusion model based on noise cropping and merging [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 8921-8930.

[12]	GULRAJANI I, AHMED F, ARJOVSKY M, et al. Improved training of Wasserstein GANs [C]//Advances in Neural Information Processing Systems. Long Beach: Curran Associates, Inc., 2017.

[13]	GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets [C]//Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2014: 2672-2680.

[14]	WAH C, BRANSON S, WELINDER P, et al. The caltech-UCSD birds-200-2011 dataset [R]. Pasadena: California Institute of Technology, 2011.

[15]	NILSBACK M E, ZISSERMAN A. Automated flower classification over a large number of classes [C]//Proceedings of the 6th Indian Conference on Computer Vision, Graphics and Image Processing. Bhubaneswar: IEEE, 2009: 722–729.

[16]	SALIMANS T, GOODFELLOW I, ZAREMBA W, et al. Improved techniques for training GANs [C]// Proceedings of Advances in Neural Information Processing Systems. Barcelona: Curran Associates, Inc., 2016: 2234–2242.

[17]	HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local Nash equilibrium [C]//Proceedings of the Neural Information Processing Systems. Long Beach: Curran Associates, Inc., 2017.

[18]	ZHU M, PAN P, CHEN W, et al. DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2020: 5795–5803.

[19]	ZHANG H, XU T, LI H, et al StackGAN: realistic image synthesis with stacked generative adversarial networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41 (8): 1947- 1962 doi: 10.1109/TPAMI.2018.2856256

[20]	RUAN S, ZHANG Y, ZHANG K, et al. DAE-GAN: dynamic aspect-aware GAN for text-to-image synthesis [C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2022: 13940–13949.

[1]	Bing YANG,Jiahui ZHOU,Jinliang YAO,Xueqin XIANG. Text-to-image generation method based on multimodal semantic information[J]. Journal of ZheJiang University (Engineering Science), 2026, 60(2): 360-369.

[2]	Canlin LI,Wenjiao ZHANG,Zhiwen SHAO,Lizhuang MA,Xinyue WANG. Semantic segmentation method on nighttime road scene based on Trans-nightSeg[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(2): 294-303.

[3]	Bing YANG,Wei NA,Xue-qin XIANG. Text-to-image generation method based on single stage generative adversarial network[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(12): 2412-2420.

[4]	Tong CHEN,Jian-feng GUO,Xin-zhong HAN,Xue-li XIE,Jian-xiang XI. Visible and infrared image matching method based on generative adversarial model[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(1): 63-74.

[5]	Hui-ya HU,Shao-yan GAI,Fei-peng DA. Face frontalization based on generative adversarial network[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(1): 116-123.

[6]	Kun LIU,Xi WEN,Min-ming HUANG,Xin-xin YANG,Jing-kun MAO. Solar cell defect enhancement method based on generative adversarial network[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(4): 684-693.

Viewed

Full text

Abstract

Cited

Shared

Discussed