Scalable high performance de-duplication backup via hash join

doi:10.1631/jzus.C0910445

Front. Inform. Technol. Electron. Eng.

2010, Vol. 11

Issue (5): 315-327 DOI: 10.1631/jzus.C0910445

Scalable high performance de-duplication backup via hash join

Tian-ming Yang, Dan Feng^*, Zhong-ying Niu, Ya-ping Wan

Wuhan National Laboratory for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China

Scalable high performance de-duplication backup via hash join

Tian-ming Yang, Dan Feng^*, Zhong-ying Niu, Ya-ping Wan

Wuhan National Laboratory for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China

全文: PDF(242 KB)

摘要： Apart from high space efficiency, other demanding requirements for enterprise de-duplication backup are high performance, high scalability, and availability for large-scale distributed environments. The main challenge is reducing the significant disk input/output (I/O) overhead as a result of constantly accessing the disk to identify duplicate chunks. Existing inline de-duplication approaches mainly rely on duplicate locality to avoid disk bottleneck, thus suffering from degradation under poor duplicate locality workload. This paper presents Chunkfarm, a post-processing de-duplication backup system designed to improve capacity, throughput, and scalability for de-duplication. Chunkfarm performs de-duplication backup using the hash join algorithm, which turns the notoriously random and small disk I/Os of fingerprint lookups and updates into large sequential disk I/Os, hence achieving high write throughput not influenced by workload locality. More importantly, by decentralizing fingerprint lookup and update, Chunkfarm supports a cluster of servers to perform de-duplication backup in parallel; it hence is conducive to distributed implementation and thus applicable to large-scale and distributed storage systems.

关键词： Backup system; De-duplication; Post-processing; Fingerprint lookup; Scalability

Abstract: Apart from high space efficiency, other demanding requirements for enterprise de-duplication backup are high performance, high scalability, and availability for large-scale distributed environments. The main challenge is reducing the significant disk input/output (I/O) overhead as a result of constantly accessing the disk to identify duplicate chunks. Existing inline de-duplication approaches mainly rely on duplicate locality to avoid disk bottleneck, thus suffering from degradation under poor duplicate locality workload. This paper presents Chunkfarm, a post-processing de-duplication backup system designed to improve capacity, throughput, and scalability for de-duplication. Chunkfarm performs de-duplication backup using the hash join algorithm, which turns the notoriously random and small disk I/Os of fingerprint lookups and updates into large sequential disk I/Os, hence achieving high write throughput not influenced by workload locality. More importantly, by decentralizing fingerprint lookup and update, Chunkfarm supports a cluster of servers to perform de-duplication backup in parallel; it hence is conducive to distributed implementation and thus applicable to large-scale and distributed storage systems.

Key words: Backup system De-duplication Post-processing Fingerprint lookup Scalability

收稿日期: 2009-07-22 出版日期: 2010-04-28

CLC:

TP309.3

基金资助: Project supported by the National Basic Research Program (973) of China (No. 2004CB318201), the National High-Tech Research and
Development Program (863) of China (No. 2008AA01A402), and the National Natural Science Foundation of China (Nos. 60703046 and
60873028)

通讯作者: Dan FENG E-mail: dfeng@hust.edu.cn

	服务
	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	Tian-ming Yang
	Dan Feng
	Zhong-ying Niu
	Ya-ping Wan

引用本文:

Tian-ming Yang, Dan Feng, Zhong-ying Niu, Ya-ping Wan. Scalable high performance de-duplication backup via hash join. Front. Inform. Technol. Electron. Eng., 2010, 11(5): 315-327.

链接本文:

http://www.zjujournals.com/xueshu/fitee/CN/10.1631/jzus.C0910445 或 http://www.zjujournals.com/xueshu/fitee/CN/Y2010/V11/I5/315

[1]	Aisha SIDDIQA , Ahmad KARIM , Abdullah GANI. Big data storage technologies: a survey[J]. Front. Inform. Technol. Electron. Eng., 2017, 18(8): 1040-1070.
[2]	Zhi-bo Wang, Zhi Wang, Hong-long Chen, Jian-feng Li, Hong-bin Li, Jie Shen. HierTrack: an energy-efficient cluster-based target tracking system for wireless sensor networks[J]. Front. Inform. Technol. Electron. Eng., 2013, 14(6): 395-406.

Viewed

Full text

Abstract

Cited

Shared

Discussed