Please wait a minute...
J4  2010, Vol. 44 Issue (4): 632-638    DOI: 10.3785/j.issn.1008-973X.2010.04.002
    
Instruction recycling based low power branch folding
MENG Jianyi, YAN Xiaolang, GE Haitong, XU Hongming
Institute of VLSI Design, Zhejiang University, Hangzhou 310027, China
Download:   PDF(0KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

After analyzing the basic features of the branch, this work proposed a high performance and low power branch folding methodology based on instruction recycling. The instruction recycling buffer reuses the same hardware resource of instruction buffer. Elapsed instructions in loops are frozen in the instruction buffer and pushed into the pipeline directly when branch folding occurs. As a result, the branch delays in the pipeline are reduced and the instruction cache accessing is eliminated during branch folding. With an adaptive instruction recycling window, the utilization efficiency of the instruction buffer will be improved and the requirements of instruction buffering and recycling will be satisfied simultaneously. Dead zone of branch prediction is exposed, and all branch prediction logics are disabled when speculative branch folding falls into the prediction dead zone. Experimental results showed that this methodology improved the performance by 5.03% and saved the power of instruction fetching by 22.10% with little hardware overhead.



Published: 14 May 2010
CLC:  TN47  
Cite this article:

MENG Jian-Yi, YAN Xiao-Lang, GE Hai-Tong. Instruction recycling based low power branch folding. J4, 2010, 44(4): 632-638.

URL:

http://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2010.04.002     OR     http://www.zjujournals.com/eng/Y2010/V44/I4/632


基于指令回收的低功耗循环分支折合技术

在分析循环分支特性的基础上,提出一种基于过期指令回收的高性能低功耗循环分支折合方法.该方法通过复用指令缓冲区硬件资源实现指令回收区.在循环分支折合过程中,循环体指令直接从回收区送入流水线,降低了分支延时,消除了指令高速缓存访问.通过自适应调整回收窗口宽度,可使有限的指令缓冲区硬件资源同时满足指令缓冲与指令回收的双重需求.当投机折合进入预测盲区时关闭分支预测存储器,从而降低投机折合的动态功耗.实验数据表明,与传统循环分支折合技术相比,应用本方法的嵌入式处理器总体性能平均提升5.03%,取指单元动态功耗下降22.10%.

[1] ZMILY A, KOZYRAKIS C. Simultaneously improving code size, performance, and energy in embedded processors [C]∥ Proceedings of the Conference on DesignAutomation and Test in Europe. Munich: European Design and Automation Association, 2006: 224229.
[2] EMMA P G, DAVIDSON E S. Characterization of branch and data dependencies in programs for evaluating pipeline performance [J]. IEEE Transactions on Computers, 1987, 36(7): 859875.
[3] FAN Dongrui, YANG Hongbo, GAO Guangrong, et al. Evaluation and choice of various branch predictors for lowpower embedded processor [J]. Journal of Computer Science and Technology, 2003, 18(6): 833838.
[4] HEYDEMANN K, BODIN F, KNIJNENBURG P M W, et al. UFS: a global tradeoff strategy for loop unrolling for VLIW architectures [C]∥ 10th International Workshop on Compilers for Parallel Computers. Chichester: John Wiley & Sons, 2006: 14131434.
[5] 亨尼西,帕特森.计算机体系结构:量化研究方法[M].3版.北京:机械工业出版社,2002: 196206.
[6] BELLAS N, HAJJ I, POLYCHRONOPOULOS C, et al. Energy and performance improvements in microprocessor design using a loop cache [C]∥ IEEE International Conference on Computer Design. Austin: IEEE, 1999: 378383.
[7] DITZEL D R, MCLELLAN H R. Branch folding in the CRISP microprocessor reducing branch delay to zero [C] ∥ Proceedings of the 14th Aannual International Symposium on Computer Architecture. Pittsburgh: ACM, 1987: 28.
[8] LEA H L, SCOTT J, MOYER B, et al. Lowcost branch folding for embedded applications with small tight loops [C]∥ 32nd Annual International Symposium on Microarchitecture. Haifa: IEEE, 1999: 103111.
[9] MALIK A, MOYER B, CERMAK D. A low power unified cache architecture providing power and performance flexibility [C]∥ International Symposium on Low Power Electronics and Design. Rapallo: ACM, 2000: 241243.
[10] PARK S H, YU S, CHO J W. Speculative branch folding for pipelined processors [J]. IEICETransactions on Information and Systems, 2005, 88(5): 10641066.
[11] CSKY MicroSystems. 32bit high performance and low power embedded processor [EB/OL]. [200308]. http:∥www.csky.com.
[12] ARM Limited. Architecture and implementation of the ARM CortexA8 microprocessor [EB/OL]. [200510]. http:∥www.arm.com/pdfs/TigerWhitepaperFinal.pdf.

No related articles found!