Please wait a minute...
IMAGE/TABLE DETAILS
Dr. Hadoop: an infinite scalable metadata management for Hadoop—How the baby elephant becomes immortal
Dipayan DEV,Ripon PATGIRI
Front. Inform. Technol. Electron. Eng.    2016, 17 (1): 15-31.   DOI: 10.1631/FITEE.1500015
Abstract   HTML   PDF (1278KB)

In this Exa byte scale era, data increases at an exponential rate. This is in turn generating a massive amount of metadata in the file system. Hadoop is the most widely used framework to deal with big data. Due to this growth of huge amount of metadata, however, the efficiency of Hadoop is questioned numerous times by many researchers. Therefore, it is essential to create an efficient and scalable metadata management for Hadoop. Hash-based mapping and subtree partitioning are suitable in distributed metadata management schemes. Subtree partitioning does not uniformly distribute workload among the metadata servers, and metadata needs to be migrated to keep the load roughly balanced. Hash-based mapping suffers from a constraint on the locality of metadata, though it uniformly distributes the load among NameNodes, which are the metadata servers of Hadoop. In this paper, we present a circular metadata management mechanism named dynamic circular metadata splitting (DCMS). DCMS preserves metadata locality using consistent hashing and locality-preserving hashing, keeps replicated metadata for excellent reliability, and dynamically distributes metadata among the NameNodes to keep load balancing. NameNode is a centralized heart of the Hadoop. Keeping the directory tree of all files, failure of which causes the single point of failure (SPOF). DCMS removes Hadoop’s SPOF and provides an efficient and scalable metadata management. The new framework is named ‘Dr. Hadoop’ after the name of the authors.


Parameter Traditional Hadoop Dr. Hadoop
Maximum number of NameNode crushes that can survive0r — 1
Number of RPCs needed for read operation11
Number of RPCs needed for write operation1r
Metadata storage per NameNodeX(X/r)m
Throughput of metadata readXXm
Throughput of metadata writeXX (m/r)


View table in article
Table 1 Analytical comparison of traditional Hadoop and Dr. Hadoop
Extracts from the Article
Dr. Hadoop uses m NameNodes to handle the read operation. The throughput of read metadata is thus m times that of the traditional Hadoop. Similarly, the throughput of write metadata operation is m/r times that of traditional Hadoop (m/r is calculated because the metadata is distributed over m NameNodes but with an overheard of replication factor r). The whole analysis and comparison are tabulated in Table 1.
Figs. 9a and 9b show the average read and write throughput in terms of successful completion of operations for Hadoop and Dr. Hadoop for both data traces. Dr. Hadoop's DCMS throughput is significantly higher than that of Hadoop. This validates our claim in Table 1. The experiment is conducted using 10 NameNodes; after few seconds in Dr. Hadoop, the speed shows some reduction only because of the extra RPC involved.
Other Images/Table from this Article