Please wait a minute...
浙江大学学报(工学版)  2019, Vol. 53 Issue (12): 2348-2356    DOI: 10.3785/j.issn.1008-973X.2019.12.012
计算机科学与人工智能     
基于动态任务迁移的近数据处理方法
华幸成(),刘鹏*()
浙江大学 信息与电子工程学院,浙江 杭州 310027
Near-data processing based on dynamic task offloading
Xing-cheng HUA(),Peng LIU*()
College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou 310027, China
 全文: PDF(1048 KB)   HTML
摘要:

为了应对大数据应用中数据移动对系统性能和能耗造成的负面影响,基于3D存储器集成存储与逻辑电路的特点和MapReduce模型的并发特性,提出一种基于动态任务迁移的近数据处理(NDP)方法. 对MapReduce应用的工作流解耦以获取核心计算任务,提供迁移机制将计算任务动态迁移到NDP单元中;采用原子操作优化数据访问,从而大幅度减少数据移动. 实验结果表明,对于MapReduce应用,提出的近数据处理方法将75%的数据移动约束在存储单元内部,有效减少了主处理单元与存储单元之间的数据移动. 与目前最先进的工作相比,所提方法在系统性能和系统能效上分别有70%和44%的提升.

关键词: 近数据处理(NDP)MapReduce3D存储器动态任务迁移    
Abstract:

A near-data processing (NDP) approach based on dynamic task offloading was proposed to address the adverse effects on system performance and energy consumption incurred by data movements in big data applications. The approach leveraged the capability of 3D memory to integrate both memory and logic circuits and the data parallelization of the MapReduce model. The workflow of MapReduce workloads was decoupled to extract the key computation tasks, and the task offloading mechanism was provided to migrate the computation tasks to NDP units dynamically. Atomic operations were employed to optimize the memory accesses, thus reducing data movements dramatically. The experimental results demonstrate that for MapReduce workloads the proposed near-data processing approach restricts 75% of the data movements within the memory module, indicating that the data movements between the main memory and the host processors are significantly reduced. Compared with the state-of-the-art, the proposed approach improved system performance and energy efficiency by 70% and 44%, respectively.

Key words: near-data processing (NDP)    MapReduce    3D memory    dynamic task offloading
收稿日期: 2018-10-24 出版日期: 2019-12-17
CLC:  TP 302  
基金资助: 国家自然科学基金?广东联合基金重点资助项目(U1401253);国家自然科学基金资助项目(61573153);广东省应用型科技研发专项资金资助项目(2016B020243011,2016B090927007);广东省自然科学基金资助项目(2016A030313510)
通讯作者: 刘鹏     E-mail: hua2009x@zju.edu.cn;liupeng@zju.edu.cn
作者简介: 华幸成(1990—),男,博士生,从事体系结构研究. orcid.org/0000-0003-2428-5787. E-mail: hua2009x@zju.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
华幸成
刘鹏

引用本文:

华幸成,刘鹏. 基于动态任务迁移的近数据处理方法[J]. 浙江大学学报(工学版), 2019, 53(12): 2348-2356.

Xing-cheng HUA,Peng LIU. Near-data processing based on dynamic task offloading. Journal of ZheJiang University (Engineering Science), 2019, 53(12): 2348-2356.

链接本文:

http://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2019.12.012        http://www.zjujournals.com/eng/CN/Y2019/V53/I12/2348

图 1  近数据处理(NDP)模块结构
图 2  NDP系统中MapReduce应用的工作流程
函数 功能描述
高层API split 对输入数据进行分片
shuffle 将Map任务的输出数据地址传递给Reduce任务
map 用户定义的Map方法
reduce 用户定义的Reduce方法
map_worker 调用NDP单元,执行用户定义的Map方法
reduce_worker 调用NDP单元,执行用户定义的Reduce方法
底层API offload_kernel 将计算核写入NDP单元
offload_data 将数据传递给NDP单元
start_computation 启动计算任务
wait_for_completion 等待计算任务完成
write_reg 向NDP单元的寄存器写入数据
read_reg 从NDP单元的寄存器读取数据
表 1  MapReduce框架的应用程序编程接口(API)
图 3  NDP单元中常驻程序的运行流程
图 4  计算核提取和迁移示例
模块 配置
主处理器 2x ARM Cortex-A15 core
32 KB L1 I Cache, 64 KB L1 D Cache
2 MB L2 Cache
近数据处理单元 1x ARM Cortex-A15 core
1MB SPM, 16-Entry TLB
DMA: Burst size 256 Bytes
3D存储单元 4 layers, 16 Vaults, 512 MB
Timing(ns):tRP = 13.75, tRCD = 13.75,
tCL = 13.75, tWR = 15, tRAS = 27.5,
tCK = 0.8, ttCCD = 5, tBURST = 3.2
表 2  近数据处理(NDP)系统的配置
图 5  NDP系统中的数据移动分布
图 6  Host和NMR系统中各个模块的能量消耗
图 7  NMR系统中Map任务和Reduce任务的加速比
图 8  Host、NDC和NMR系统的性能对比
图 9  NDP单元数量对系统性能的影响
1 SIEGL P, BUCHTY R, BEREKOVIC M. Data-centric computing frontiers: A survey on processing-in-memory [C] // Proceedings of the Second International Symposium on Memory Systems. Alexandria: ACM, 2016: 295-308
2 DEAN J, GHEMAWAT S MapReduce: simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51 (1): 107- 113
doi: 10.1145/1327452.1327492
3 WANG L, ZHAN J, LUO C, et al. Bigdatabench: A big data benchmark suite from internet services [C] // IEEE International Symposium on High Performance Computer Architecture. Orlando: IEEE, 2014: 488-499
4 KECKLER S W, DALLY W J, KHAILANY B, et al GPUs and the future of parallel computing[J]. IEEE Micro, 2011, 31 (5): 7- 17
doi: 10.1109/MM.2011.89
5 BALASUBRAMONIAN R, CHANG J, MANNING T, et al Near-data processing: insights from a micro-46 workshop[J]. IEEE Micro, 2014, 34 (4): 36- 42
doi: 10.1109/MM.2014.55
6 LEE D U, KIM K W, KIM K W, et al. A 1.2 V 8Gb 8-channel 128 GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29 nm process and TSV [C] // IEEE International Solid-State Circuits Conference Digest of Technical Papers. San Francisco : IEEE, 2014: 432−433
7 Hybrid memory cube specification 2.1[EB/OL]. [2018-01-19]. http://hybridmemorycube.org/.
8 GUTIERREZ A, CIESLAK M, GIRIDHAR B, et al. Integrated 3D-stacked server designs for increasing physical density of key-value stores [C] // ACM SIGPLAN Notices. Salt Lake City : ACM, 2014: 485−498.
9 AHN J, HONG S, YOO S, et al A scalable processing-in-memory accelerator for parallel graph processing[J]. ACM SIGARCH Computer Architecture News, 2016, 43 (3): 105- 117
10 AZARKHISH E, ROSSI D, LOI I, et al. Design and evaluation of a processing-in-memory architecture for the smart memory cube [C] // International Conference on Architecture of Computing Systems. Nuremberg: Springer, 2016: 19−31.
11 NAI L, HADIDI R, SIM J, et al. Graphpim: Enabling instruction-level pim offloading in graph computing frameworks [C] // IEEE International Symposium on High Performance Computer Architecture. Austin: IEEE, 2017: 457−468.
12 AZARKHISH E, ROSSI D, LOI I, et al Neurostream: Scalable and energy efficient deep learning with smart memory cubes[J]. IEEE Transactions on Parallel and Distributed Systems, 2018, 29 (2): 420- 434
doi: 10.1109/TPDS.2017.2752706
13 PUGSLEY S H, JESTES J, ZHANG H, et al. NDC: Analyzing the impact of 3D-stacked memory+ logic devices on MapReduce workloads [C] // IEEE International Symposium on Performance Analysis of Systems and Software. Monterey: IEEE, 2014: 190−200.
14 Apache Hadoop [EB/OL]. [2018-01-19]. http://hadoop.apache.org/.
15 KOGGE P M. EXECUBE-A new architecture for scaleable MPPs [C] // International Conference on Parallel Processing. North Carolina: IEEE, 1994: 77−84.
16 PATTERSON D, ANDERSON T, CARDWELL N, et al A case for intelligent DRAM: IRAM[J]. IEEE Micro, 1997, 33- 44
17 OSKIN M, CHONG F T, SHERWOOD T. Active Pages: a computation model for intelligent memory [C] // Proceedings of the International Symposium on Computer Architecture. Barcelona: IEEE, 1998: 192−203.
18 CHU M, JAYASENA N, ZHANG D, et al. High-level programming model abstractions for processing in memory [C] // Workshop on Near-Data Processing. California: IEEE, 2013: 1−4.
19 GAO M, AYERS G, KOZYRAKIS C. Practical near-data processing for in-memory analytics frameworks [C] // International Conference on Parallel Architecture and Compilation. San Francisco: IEEE, 2015: 113−124.
20 BINKERT N, BECKMANN B, BLACK G, et al The gem5 simulator[J]. ACM SIGARCH Computer Architecture News, 2011, 39 (2): 1- 7
doi: 10.1145/2024716.2024718
21 TUDOR B M, TEO Y M. On understanding the energy consumption of arm-based multicore servers [C] // ACM SIGMETRICS Performance Evaluation Review. Pittsburgh: ACM, 2013: 267−278.
22 CACTI: An integrated cache and memory access time, cycle time, area, leakage, and dynamic power model [EB/OL]. [2018-01-19]. http://www.hpl.hp.com/research/cacti/.
23 CHANDRASEKAR K, AKESSON B, GOOSSENS K. Improved power modeling of DDR SDRAMs [C] // Euromicro Conference on Digital System Design. Oulu: IEEE, 2011: 99−108.
24 PAWLOWKI J T. Hybrid memory cube: breakthrough DRAM performance with a fundamentally re-architected DRAM subsystem[C] // Hot Chips. Stanford: IEEE, 2011: 1−24.
25 JEDDELOH J, KEETH B. Hybrid memory cube new DRAM architecture increases density and performance [C] // Symposium on VLSI Technology. Honolulu: IEEE, 2012: 87−88.
26 ROSENFELD P. Performance exploration of the hybrid memory cube[D]. Maryland: University of Maryland, 2014.
[1] 吴健,倪益华,吕艳. 供应链中本体驱动的分布式信息处理[J]. 浙江大学学报(工学版), 2014, 48(11): 2017-2024.
[2] 宋杰, 侯泓颖, 王智, 朱志良. 云计算环境下改进的能效度量模型[J]. J4, 2013, 47(1): 44-52.