Please wait a minute...
浙江大学学报(工学版)
计算机科学技术     
大数据技术研究综述
刘智慧, 张泉灵
浙江大学 智能系统与控制研究所,浙江 杭州 310027
Research overview of big data technology
LIU Zhi-hui, ZHANG Quan-ling
Institute of Cyber-systems and Control, Zhejiang University, Hangzhou 310027, China
 全文: PDF(2921 KB)   HTML
摘要:

大数据的产生给海量信息处理技术带来新的挑战.为了更全面深入地了解大数据的内涵,从大数据的概念特征、一般处理流程、关键技术三个方面进行详细阐述.分析了大数据的产生背景,简述了大数据的基本概念、典型的4“V”特征以及重点应用领域;归纳总结了大数据处理的一般流程,针对其中的关键技术,如MapReduce、GFS、BigTable、Hadoop以及数据可视化等,介绍了基本的处理过程和组织结构;具体分析指出了大数据时代所面临的问题与挑战.

Abstract:

Abstract: The emergence of “big data” has brought new challenges to mass information processing technology. This comprehensive overview was intended to elaborate on big data from three aspects: the concept and characteristics, general data processing framework and key techniques. The background of big data was explained, and the basic concepts, typical 4“V” characteristics as well as related application fields were sketched. Then, the general procedures of big data processing were summarized, and fundamental analysis and description of the key techniques, such as MapReduce, GFS, BigTable, Hadoop and data visualization, were given as well. Finally, the new issues and challenges in the Big Data Era were pointed out.

出版日期: 2015-04-01
:  TP 391  
基金资助:

国家“十二五”科技支撑计划资助项目(2012BAF10B04).

通讯作者: 张泉灵,男,副研究员.     E-mail: qlzhang@iipc.zju.edu.cn
作者简介: 刘智慧(1989—),女,硕士生,从事大数据处理技术方面的研究.E-mail:zhihui891126@163.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章  

引用本文:

刘智慧, 张泉灵. 大数据技术研究综述[J]. 浙江大学学报(工学版), 10.3785/j.issn.1008-973X.2014.06.001.

LIU Zhi-hui, ZHANG Quan-ling. Research overview of big data technology. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 10.3785/j.issn.1008-973X.2014.06.001.

链接本文:

http://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2014.06.001        http://www.zjujournals.com/eng/CN/Y2014/V48/I6/957

[1] NAISBITT J. Megatrends:Ten new directions transforming our live[M]. New York:Warner Books, 1982: 40-42.
[2] 阿尔文·托勒夫.第三次浪潮[M].黄明坚译.北京:中信出版社, 2006: 19-25.
[3] GOLDSTON D. Big data: data wrangling [J/OL]. Nature, 2008, 455: 15. [2013-07-24]. http:∥www.nature.com/nature/index.html.
[4] REICHMAN O J, MATTHEW B, MARK P H, et al. Challenges and opportunities of open data in ecology [J/OL]. Science, 2011, 311(6018): 703705. [2013-07-23]. http:∥www.sciencemag.org/.
[5] MANYIKA J, CHUI M, BROWN B, et al. Big data: The next frontier for innovation, competition, and productivity[R/OL]. Las Vegas: The McKinsey Global Institute. [2013-07-24]. http:∥www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation.
[6] World Economic Forum. Big data,big impact: New possibilities for international development[EB/OL]. [2013-07-24]. http:∥www.weforum.org/reports/big-data-big-impact-new-possibilities-international-development.
[7] Office of Science and Technology Policy Executive, Office of the President. Obama administration unveils “Big Data” initiative: Announces $200 million in new R&D investments[EB/OL]. [2013-07-24]. http:∥www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_press_release_final_2.pdf.
[8] Leading researchers across the United States. Challenges and Opportunities with Big Data[R/OL]. New York: United Nations. [2013-07-24]. http:∥www.cra.org/ccc/files/docs/init/bigdatawhitepaper.pdf.
[9] IDC. 中国互联网市场洞见:互联网大数据技术创新研究[R/OL]. Beijing: IDC国际数据公司. [2013-07-24]. http:∥www.idc.com/getdoc.jspcontainerId=CH1749312U.
[10] Executive Office of the President. Designing a future: Federally funded research and development in network and information technology[R]. New York: Executive Office of the President, 2010, 10.
[11] 黄哲学,曹付元,李俊杰,等.面向大数据的海云数据系统关键技术研究[J].网络新媒体技术, 2012, 1(6): 20-26.
HUANG Zhe-xue, CAO Fu-yuan, LI Jun-jie, et al. A research of sea-cloud data system key technology for the big data[J]. Microcomputer Applications, 2012, 1(6): 20-26.
[12] 工业和信息化部.《物联网“十二五”发展规划》发布[EB/OL]. [2012-02-14]. http:∥www.gov.cn/zwgk/2012-02/14/content_2065999.htm.
[13] 李国杰,华云生.网络数据科学与工程———门新兴的交叉学科?香山科学会议第424学术讨论会综述[R/OL].北京:香山学术会议. [2012-10-12]. http:∥www.meeting.edu.cn/meeting/subject/review!detailfp.actionid=1697.
[14] 天津大学. 863项目“面向大数据的先进存储结构及关键技术”启动会[EB/OL]. [2013-04-01]. http:∥cs.tju.edu.cn/xwzx/xwdt/20130401103608433EtH.shtml.
[15] 于艳华,宋美娜. 大数据[J]. 中兴通讯技术,2013(1): 57-60.
YU Yan-hua, SONG Mei-na. Big data[J]. ZTE Communication, 2013(1): 57-60.
[16] 吴吉义,傅建庆,张明西,等. 云数据管理研究综述[J]. 电信科学,2010(5): 34-41.
WU Ji-yi,FU Jian-qing,ZHANG Ming-xi,et al. Cloud data management: a survey[J]. Telecommunications Science, 2010(5): 34-41.
[17] 张意轩,于洋.人民日报:大数据时代的大媒体[EB/OL].[2013-01-17]. http:∥www.peopledaily.me/archives/6797.
[18] ESnet. Network introducing ESnet5: The fifth generation of the energy sciences network a new 100 gigabit per second nationwide platform for science discovery[EB/OL]. [2013-07-24]. http:∥www.es.net/introducing-esnet5/.
[19] 孙其博,刘杰,黎羴,等.物联网:概念、架构与关键技术研究综述[J].北京邮电大学学报,2010, 33(3):19.
SUN Qi-bo,LIU Jie,LI Shan,et al. Internet of things: Summarize on concepts,architecture and key technology problem[J]. Journal of Beijing University of Posts and Telecommunications, 2010, 33(3): 19.
[20] 沈苏彬,毛燕琴,范曲立,等.物联网概念模型与体系结构[J]. 南京邮电大学学报:自然科学版, 2010, 30(4): 18.
SHEN Su-bin, MAO Yan-qin, FAN Qu-li, et al. The concept model and ararchitecture of the Internet of things[J]. Journal of Nanjing University of Posts and Telecommunications: Natural Science, 2010,30(4): 18.
[21] 李国杰,程学旗.大数据研究:未来科技及经济社会发展的重大战略领域——大数据的研究现状与科学思考[J].中国科学院院刊,2012,27(6): 647-657.
LI Guo-jie,CHENG Xue-qi. Research status and scientific thinking of big data[J]. Bulletin of Chinese Academy of Sciences, 2012,27(6): 647-657.
[22] 余长慧,潘和平.商业智能及其核心技术[J].计算机应用研究,2002(9): 1416, 26.
YU Chang-hui,PAN He-ping. Bussiness intelligence and it’s key technology[J]. Application Research of Computers, 2002(9): 14-16, 26.
[23] 熊忠阳.面向商业智能的并行数据挖掘技术及应用研究[D].重庆: 重庆大学, 2004.
XIONG Zhong-yang. Research on parallel data mining and applicationg for business intelligence[D]. Chongqing: Chongqing University, 2004.
[24] 涂子沛.大数据[M].桂林:广西师范大学出版社,2012: 5458.
[25] TONY H,STEWARD T,KRISTIN T. 第四范式:数据密集型科学发现[M].潘教峰,等译. 北京:科学出版社,2012: 15-19.
[26] 人民日报.大数据成信息技术领域热门概念[EB/OL].[2012-02-22]. http:∥www.c114.net/news/212/a747301.html.
[27] 中国信息产业网.大数据的四个典型特征[EB/OL].[2012-12-04].http:∥cyyw.cena.com.cn/a/2012-12-04/135458292978407.shtml.
[28] HAMISH B. IIIS: The ′four Vs′ of big data[EB/OL]. [2013-07-24]. http:∥www.computerworld.com.au/article/396198/iiis_four_vs_big_data/.
[29] 严霄凤,张德馨.大数据研究[J].计算机技术与发展, 2013, 23(4): 168-172.
YAN Xiao-feng,ZHANG De-xin. Big data research[J]. Computer Technology and Development, 2013, 23(4): 168-172.
[30] 陈如明.大数据时代的挑战、价值与应对策略[J].移动通信,2012(17): 14-15.
CHEN Ru-ming. Challenges, values and countermeasures of the era of big data[J]. Mobile Communication, 2012(17): 14-15.
[31] 高勇.啤酒与尿布[M].北京: 清华大学出版社, 2008: 16.
[32] 冯海超.大数据的中国机会[EB/OL].[2012-07-24]. http:∥www.aliresearch.com/m-cms-q-view-id-74428.html.
[33] 维克托·迈尔-舍恩伯格,肯尼斯·库克耶.大数据时代[M].盛杨燕,等译.杭州:浙江人民出版社,2013: 54-58.
[34] 步国军.医疗信息系统数据整合和数据挖掘研究[D].北京:北京工业大学,2010.
BU Guo-jun.Data integration and data mining research in the medical information system [D]. Beijing:Beijing University of Technology, 2010.
[35] 工业和信息化部.医疗发展“十二五”规划[EB/OL].[2013-07-24]. http:∥wenku.baidu.com/view/3f357433a32d7375a417801f.html.
[36] ZDNet知识信息管理.制造业的大数据时代[EB/OL].[2012-07-09]. http:∥ec.zdnet.com.cn/managesoft/2012/0709/2100817.shtml.
[37] 苏畅.制造业三层面优化大数据[EB/OL].[2012-10-08]. http:∥www.cbinews.com/software/news/2012-10-08/193428.htm.
[38] Lab of Web and Mobile Data Management. WAMDM Homepage[EB/OL]. [2013-07-24]. http:∥idke.ruc.edu.cn/index.htm.
[39] 孟小峰,慈祥.大数据管理:概念、技术与挑战[J].计算机研究与发展,2013,50(1): 146-169.
MENG Xiao-feng,CI Xiang. Big data management: concepts, techniques and challenges[J]. Journal of Computer Research and Development, 2013, 50(1): 146-169.
[40] LM Ni, YLIU, YC Lau, et al. LANDMARC: Indoor location sensing using active RFID[J]. Wireless Networks, 2004, 10(6): 701-710.
[41] 李乔,郑啸.云计算研究现状综述[J].计算机科学,2011,38(4): 32-37.
LI Qiao, ZHENG Xiao. Research survey of cloud computing[J]. Computer Science,2011,38(4): 32-37.
[42] GHEMAWAT S, GOBIOFF H, LEUNG S T. The google file system[J]. ACM SIGOPS Operating Systems Review, 2003,37(5): 29-43.
[43] CHANG F, DEAN J, GHEMAWAT S, et al. BigTable: A distributed storage system for structured data[J]. ACM Transactions on Computer Systems, 2008,26(2): 4.
[44] DEAN J, GHEMAWAT S. MapReduce: Simplified data processing on large clusters[J]. Communications of the ACM 51, 2008(1): 107-113.
[45] 杨宸铸.基于HADOOP的数据挖掘研究[D].重庆:重庆大学,2010.
YANG Chen-zhu. The research of data mining based on HADOOP[D]. Chongqing:Chongqing Universicy, 2010.
[46] 贺全兵.可视化技术的发展及应用[J].中国西部科技,2008,7(4): 47.
HE Quan-bing. The development and application of Visualization technique[J]. Science and Technology of West China, 2008, 7(4): 47.
[47] 刘勘,周晓峥,周洞汝.数据可视化的研究与发展[J].计算机工程,2002,28(8): 12,63.
LIUI Kan, ZHOU Xiao-zheng, ZHOU Dong-ru. Data visualization research and development[J]. Computer Engineering, 2002, 28(8): 12,63.
[48] FOSTER I, ZHAO Y, RAICU I, et al. Cloud computing and grid computing 360-degree compared[C]∥Proceedings of the Grid Computing Environments Workshop 2008GCE’08). Austin: IEEE, 2008: 110.
[49] 维基百科.云计算[EB/OL]. [2013-07-24]. http:∥zh.wikipedia.org/wiki/%E4%BA%91%E8%AE%A1%E7%AE%97.
[50] 罗军舟,金嘉晖,宋爱波,等.云计算:体系架构与关键技术[J].通信学报,2011,32(7): 321.
LUO Jun-zhou, JIN Jia-hui, SONG Ai-bo, et al. Cloud computing:architecture and key technologies[J]. Journal on Communications, 2011, 32(7): 321.
[51] 陈康,郑纬民.云计算:系统实例与研究现状[J].软件学报,2009,20(5): 1337-1348.
CHEN Kang, ZHENG Wei-min. Cloud computing:System instances and current research[J]. Journal of Software, 2009, 20(5): 1337-1348.
[52] 李成华,张新访,金海,等.MapReduce:新型的分布式并行计算编程模型[J].计算机工程与科学,2011, 33(3): 129-135.
LI Cheng-hua, ZHANG Xin-fang,JIN Hai,et al. MapReduce: A new programming model for distributed parallel computing[J]. Computer Engineering And Science, 2011, 33(3): 129-135.
[53] 覃雄派,王会举,杜小勇,等.大数据分析——RDBMS与MapReduce的竞争与共生[J].软件学报,2012,23(1): 32-45.
QIN Xiong-pai, WANG Hui-ju,DU Xiao-yong, et al. Big data analysis——Competition and symbiosis of RDBMS and MapReduce[J]. Journal of Software, 2012, 23(1): 32-45.
[54] The Apache Software Foundation. HDFS Architecture[EB/OL]. [2013-07-24]. http:∥hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html.
[55] SINGH A K. Smart grid cloud[J]. Sensors, 2012, 2(26): 674-704.
[56] MOLINA-ESTOLANO E, GOKHALE M, MALTZAHN C, et al. Mixing Hadoop and HPC workloads on parallel filesystems[C]∥Proceedings of the 4th Annual Workshop (PDSW ′09). New York:ACM,2009: 15.
[57] BEAVER D, KUMAR S, LI H C, ey al. Finding a needle in haystack:facebook′s photo storage[C]∥ Proceedings of OSDI 2010. Berkeley CA:USENIX Association, 2010: 18.
[58] TaoCode. TFS[EB/OL]. [2013-07-24]. http:∥code.taobao.org/p/tfs/wiki/index/.
[59] BURROWS M. The Chubby lock service for loosely-coupled distributed systems[C]∥Proceedings of the 7th Symposium on Operating Systems Design and Implementation 2006. Berkeley:USENIX Association,2006: 335-350.
[60] COOPER B F, RAMAKRISHNAN R, SRIVASTAVA U, et al. PNUTS: Yahoo!′s hosted data serving platform[C]∥ Proceedings of the VLDB Endowment 2008. Auckland:ACM, 2008: 1277-1288.
[61] DECANDIA G, HASTORUN D, JAMPANI M, et al. Dynamo:Amazon′s highly available key-value store[C]∥ Procedings of SOSP 2007. New York:ACM,2007: 205-220.
[62] NoSQL Databases. NoSQL Definition[EB/OL].[2013-07-24]. http:∥nosql-database.org/.
[63] 李方超.基于NOSQL的数据最终一致性策略研究[D].哈尔滨:哈尔滨工程大学,2012.
LI Fang-chao. Research of data eventually consistent strategy based on NOSQL \[D\]. Harbin: Harbin Engineering University, 2012.
[64] 王宏宇.Hadoop平台在云计算中的应用[J].软件,2011,32(4): 3638,50.
WANG Hong-yu. An application of Hadoop platform in cloud computing[J]. Software, 2011,32(4): 3638, 50.
[65] 黄晓云.基于HDFS的云存储服务系统研究[D].大连: 大连海事大学,2010.
HUANG Xiao-yun. Research of cloud storage service system based on HDFS[D]. Dalian:Dalian Maritime University, 2010.
[66] The Apache Software Foundation. Hbase[EB/OL]. [2013-07-24]. http:∥hbase.apache.org/.
[67] The Apache Software Foundation. Mahout[EB/OL]. [2013-07-24]. http:∥mahout.apache.org/.
[68] The Apache Software Foundation.Hive[EB/OL]. [2013-07-24]. http:∥hive.apache.org/.
[69] The Apache Software Foundation.Pig Latin basics[EB/OL]. [2013-07-24]. http:∥pig.apache.org/docs/r0.10.0/basic.html.
[70] The Apache Software Foundation.Zookeeper[EB/OL]. [2013-07-24]. http:∥zookeeper.apache.org/.
[71] The Apache Software Foundation.Sqoop[EB/OL]. [2013-07-24]. http:∥sqoop.apache.org/.
[72] The Apache Software Foundation.Flume[EB/OL]. [2013-07-24]. http:∥flume.apache.org/.
[73] 唐泽圣,陈莉,邓俊辉.三维数据场可视化[M].北京:清华大学出版社,1999: 16.
[74] 王媛媛,丁毅,孙媛媛,等.数据可视化技术的实现方法研究[J].现代电子技术,2007(4): 71-74.
WANG Yuan-yuan,DING Yi, SUN Yuan-yuan, et al. Research on data visualization implementation methods[J]. Modern Electronics Technique, 2007(4): 71-74.
[75] 吴加敏,孙连英,张德政.空间数据可视化的研究与发展[J].计算机工程与应用,2002(10): 85-88.
WU Jia-min, SUN Lian-ying, ZHANG De-zheng. Research and development of spacial data visualization[J]. Computer Engineering and Applications, 2002(10): 85-88.
[76] ENIKEEV R. The Internet Map[EB/OL]. [2013-07-24]. http:∥internet-map.net/.
[77] KASER O, LEMIRE D. Tag-cloud drawing: algorithms for cloud visualization[J]. Computing Research Repository, 2007, 70: 109-118.
[78] VIGAS F B, WATTENBERG M, DAVE K. Studying cooperation and conflict between authors with history flow visualizations[C]∥Proceedings of the SIGCHI Conference on Human Factors in Computing Systems 2004. New York: ACM,2004: 575-582.
[79] 李凌燕.OLAP系统中多维数据可视化的实现[J].现代电子技术,2007(10): 142-145.
LI Ling-yan. Implementation of multidimensional data visualization in OLAP system[J]. Modern Electronics Technique, 2007(10): 142-145.
[80] 施惠娟,孙蕾,李由.关联规则下数据挖掘可视化技术的探讨与实现[J].计算机与现代化,2010(2): 166-169, 172.
SHI Hui-juan, SUN Lei, LI You. Research and implementation of association rules mining visualization[J]. Computer and Modernization, 2010(2): 166-169, 172.
[81] LINDELL Y,PINKAS B. Privacy preserving data mining[J]. Journal of Cryptology, 2002, 15(3): 177-206.
[82] SWEENEY L. k-Anonymity: A model for protecting privacy[J]. International Journal of Uncertainty,Fuzziness and Knowledge-Based Systems, 2002, 10(5): 557-570.
[83] DWORK C. Differential privacy[C]∥Proceedings of the 33rd International Colloquium, ICALP 2006. Venice: IEEE, 2006, 4052: 112.
[84] ROY I, RAMADAN H E, SETTY S T V,et al. Airavat: Security and privacy for MapReduce[C]∥Proceedings of the 7th usenix symmp. on Networked Systems Design and Implementation. San Jose: USENIX Association, 2010: 297-312.
[85] 郎杨琴,孔丽华.美国发布“大数据的研究和发展计划”[J].科研信息化技术与应用,2012,3(2): 89-93.
LANG Yang-qin,KONG Li-hua. The U. S. Governament released big data research and development initiative \[J\]. E-Science Technology & Application, 2012, 3(2): 89-93.
[86] BREWER E A. Towards robust distributed systems[C]∥ Proceedings of Symposium on Principles of Distributed Computing 2000. New York: ACM,2000.
[87] GLANZ J. Power, pollution and the Internet[N]. The New York Times, 20120920.
[88] 杰里米·里夫金.第三次工业革命:新经济模式如何改变世界[M].张体伟,等译.北京: 中信出版社,2012: 34-67.

[1] 何雪军, 王进, 陆国栋, 刘振宇, 陈立, 金晶. 基于三角网切片及碰撞检测的工业机器人三维头像雕刻[J]. 浙江大学学报(工学版), 2017, 51(6): 1104-1110.
[2] 王桦, 韩同阳, 周可. 公安情报中基于关键图谱的群体发现算法[J]. 浙江大学学报(工学版), 2017, 51(6): 1173-1180.
[3] 尤海辉, 马增益, 唐义军, 王月兰, 郑林, 俞钟, 吉澄军. 循环流化床入炉垃圾热值软测量[J]. 浙江大学学报(工学版), 2017, 51(6): 1163-1172.
[4] 毕晓君, 王佳荟. 基于混合学习策略的教与学优化算法[J]. 浙江大学学报(工学版), 2017, 51(5): 1024-1031.
[5] 王亮, 於志文, 郭斌. 基于双层多粒度知识发现的移动轨迹预测模型[J]. 浙江大学学报(工学版), 2017, 51(4): 669-674.
[6] 廖苗, 赵于前, 曾业战, 黄忠朝, 张丙奎, 邹北骥. 基于支持向量机和椭圆拟合的细胞图像自动分割[J]. 浙江大学学报(工学版), 2017, 51(4): 722-728.
[7] 穆晶晶, 赵昕玥, 何再兴, 张树有. 基于凹凸变换与圆周拟合的重叠气泡轮廓重构[J]. 浙江大学学报(工学版), 2017, 51(4): 714-721.
[8] 黄正宇, 蒋鑫龙, 刘军发, 陈益强, 谷洋. 基于融合特征的半监督流形约束定位方法[J]. 浙江大学学报(工学版), 2017, 51(4): 655-662.
[9] 蒋鑫龙, 陈益强, 刘军发, 忽丽莎, 沈建飞. 面向自闭症患者社交距离认知的可穿戴系统[J]. 浙江大学学报(工学版), 2017, 51(4): 637-647.
[10] 戴彩艳, 陈崚, 李斌, 陈伯伦. 复杂网络中的抽样链接预测[J]. 浙江大学学报(工学版), 2017, 51(3): 554-561.
[11] 刘磊, 杨鹏, 刘作军. 采用多核相关向量机的人体步态识别[J]. 浙江大学学报(工学版), 2017, 51(3): 562-571.
[12] 郭梦丽, 达飞鹏, 邓星, 盖绍彦. 基于关键点和局部特征的三维人脸识别[J]. 浙江大学学报(工学版), 2017, 51(3): 584-589.
[13] 王海军, 葛红娟, 张圣燕. 基于核协同表示的快速目标跟踪算法[J]. 浙江大学学报(工学版), 2017, 51(2): 399-407.
[14] 张亚楠, 陈德运, 王莹洁, 刘宇鹏. 基于增量图形模式匹配的动态冷启动推荐方法[J]. 浙江大学学报(工学版), 2017, 51(2): 408-415.
[15] 刘宇鹏, 乔秀明, 赵石磊, 马春光. 统计机器翻译中大规模特征的深度融合[J]. 浙江大学学报(工学版), 2017, 51(1): 46-56.