Dr. Hadoop: an infinite scalable metadata management for Hadoop—How the baby elephant becomes immortal

Dipayan DEV,Ripon PATGIRI

Front. Inform. Technol. Electron. Eng. 2016, 17 (1): 15-31. DOI: 10.1631/FITEE.1500015

Abstract

HTML

PDF (1278KB)

In this Exa byte scale era, data increases at an exponential rate. This is in turn generating a massive amount of metadata in the file system. Hadoop is the most widely used framework to deal with big data. Due to this growth of huge amount of metadata, however, the efficiency of Hadoop is questioned numerous times by many researchers. Therefore, it is essential to create an efficient and scalable metadata management for Hadoop. Hash-based mapping and subtree partitioning are suitable in distributed metadata management schemes. Subtree partitioning does not uniformly distribute workload among the metadata servers, and metadata needs to be migrated to keep the load roughly balanced. Hash-based mapping suffers from a constraint on the locality of metadata, though it uniformly distributes the load among NameNodes, which are the metadata servers of Hadoop. In this paper, we present a circular metadata management mechanism named dynamic circular metadata splitting (DCMS). DCMS preserves metadata locality using consistent hashing and locality-preserving hashing, keeps replicated metadata for excellent reliability, and dynamically distributes metadata among the NameNodes to keep load balancing. NameNode is a centralized heart of the Hadoop. Keeping the directory tree of all files, failure of which causes the single point of failure (SPOF). DCMS removes Hadoop’s SPOF and provides an efficient and scalable metadata management. The new framework is named ‘Dr. Hadoop’ after the name of the authors.

This section provides the performance evaluation of DCMS of Dr. Hadoop using trace-driven simulation. Locality of namespace is first carefully observed and then we perform the scalability measurement of DCMS. Performance evaluation of DCMS against locality preservation is compared with: (1) FileHash in which files are randomly distributed based on their pathnames, each of them assigned to an MDS; (2) DirHash in which directories are randomly distributed just like in FileHash. Each NameNode identifier in the experiment is 160 bits in size, obtained from the SHA1 hash function. We use real traces as shown in Table 2. Yahoo means traces of NFS and email by the Yahoo finance group and its data size is 256.8 GB (including access pattern information). Microsoft means traces of Microsoft Windows production (Kavalanekar et al., 2008) of build servers from BuildServer00 to BuildServer07 within 24 h, and its data size is 223.7 GB (access pattern information included). A metadata crawler is applied to the datasets that recursively extract file/directory metadata using the stat () function.

The scalability of Dr. Hadoop is analyzed with the two real traces as shown in Table 2. The growth of metadata (namespace) is studied with the increasing data uploaded to the cluster in HDFS. These observations are tabulated in Tables 3 , 4 , and 5 for Hadoop and Dr. Hadoop with 10 GB data uploaded on each attempt. The metadata size of Table 5 is 3 times its original size for Yahoo and Microsoft respectively because of the replication. Figs. 8a and 8b represent the scalability of Hadoop and Dr. Hadoop in terms of load (MB) /NameNode for both datasets. The graphs show a linear increment in the metadata size of Hadoop and Dr. Hadoop. In traditional Hadoop, with the increase in data size in the DataNodes, the metadata is likely to grow to the upper bound of the main memory of a single NameNode. So, the maximum limit of data size that DataNodes can afford is limited to the size of available memory on the single NameNode server. In Dr. Hadoop, DCMS provides a cluster of NameNodes, which reduces the metadata load rate per NameNode for the cluster. This results in enormous increase in the storage capacity of Dr. Hadoop.