Please wait a minute...
J4  2010, Vol. 44 Issue (4): 632-638    DOI: 10.3785/j.issn.1008-973X.2010.04.002
孟建熠, 严晓浪, 葛海通
浙江大学 超大规模集成电路设计研究所,浙江 杭州 310027
Instruction recycling based low power branch folding
MENG Jianyi, YAN Xiaolang, GE Haitong, XU Hongming
Institute of VLSI Design, Zhejiang University, Hangzhou 310027, China
 全文: PDF  HTML



After analyzing the basic features of the branch, this work proposed a high performance and low power branch folding methodology based on instruction recycling. The instruction recycling buffer reuses the same hardware resource of instruction buffer. Elapsed instructions in loops are frozen in the instruction buffer and pushed into the pipeline directly when branch folding occurs. As a result, the branch delays in the pipeline are reduced and the instruction cache accessing is eliminated during branch folding. With an adaptive instruction recycling window, the utilization efficiency of the instruction buffer will be improved and the requirements of instruction buffering and recycling will be satisfied simultaneously. Dead zone of branch prediction is exposed, and all branch prediction logics are disabled when speculative branch folding falls into the prediction dead zone. Experimental results showed that this methodology improved the performance by 5.03% and saved the power of instruction fetching by 22.10% with little hardware overhead.

出版日期: 2010-05-14
:  TN47  
通讯作者: 严晓浪,男,教授,博导.     E-mail:
作者简介: 孟建熠(1982—),男,浙江上虞人,博士生,主要从事高性能低功耗嵌入式处理器设计与研究.E-mail:
E-mail Alert


孟建熠, 严晓浪, 葛海通. 基于指令回收的低功耗循环分支折合技术[J]. J4, 2010, 44(4): 632-638.

MENG Jian-Yi, YAN Xiao-Lang, GE Hai-Tong. Instruction recycling based low power branch folding. J4, 2010, 44(4): 632-638.


[1] ZMILY A, KOZYRAKIS C. Simultaneously improving code size, performance, and energy in embedded processors [C]∥ Proceedings of the Conference on DesignAutomation and Test in Europe. Munich: European Design and Automation Association, 2006: 224229.
[2] EMMA P G, DAVIDSON E S. Characterization of branch and data dependencies in programs for evaluating pipeline performance [J]. IEEE Transactions on Computers, 1987, 36(7): 859875.
[3] FAN Dongrui, YANG Hongbo, GAO Guangrong, et al. Evaluation and choice of various branch predictors for lowpower embedded processor [J]. Journal of Computer Science and Technology, 2003, 18(6): 833838.
[4] HEYDEMANN K, BODIN F, KNIJNENBURG P M W, et al. UFS: a global tradeoff strategy for loop unrolling for VLIW architectures [C]∥ 10th International Workshop on Compilers for Parallel Computers. Chichester: John Wiley & Sons, 2006: 14131434.
[5] 亨尼西,帕特森.计算机体系结构:量化研究方法[M].3版.北京:机械工业出版社,2002: 196206.
[6] BELLAS N, HAJJ I, POLYCHRONOPOULOS C, et al. Energy and performance improvements in microprocessor design using a loop cache [C]∥ IEEE International Conference on Computer Design. Austin: IEEE, 1999: 378383.
[7] DITZEL D R, MCLELLAN H R. Branch folding in the CRISP microprocessor reducing branch delay to zero [C] ∥ Proceedings of the 14th Aannual International Symposium on Computer Architecture. Pittsburgh: ACM, 1987: 28.
[8] LEA H L, SCOTT J, MOYER B, et al. Lowcost branch folding for embedded applications with small tight loops [C]∥ 32nd Annual International Symposium on Microarchitecture. Haifa: IEEE, 1999: 103111.
[9] MALIK A, MOYER B, CERMAK D. A low power unified cache architecture providing power and performance flexibility [C]∥ International Symposium on Low Power Electronics and Design. Rapallo: ACM, 2000: 241243.
[10] PARK S H, YU S, CHO J W. Speculative branch folding for pipelined processors [J]. IEICETransactions on Information and Systems, 2005, 88(5): 10641066.
[11] CSKY MicroSystems. 32bit high performance and low power embedded processor [EB/OL]. [200308]. http:∥www.c
[12] ARM Limited. Architecture and implementation of the ARM CortexA8 microprocessor [EB/OL]. [200510]. http:∥

No related articles found!