Dr. Hadoop: an infinite scalable metadata management for Hadoop—How the baby elephant becomes immortal

Dipayan DEV,Ripon PATGIRI

Front. Inform. Technol. Electron. Eng. 2016, 17 (1): 15-31. DOI: 10.1631/FITEE.1500015

Abstract

HTML

PDF (1278KB)

In this Exa byte scale era, data increases at an exponential rate. This is in turn generating a massive amount of metadata in the file system. Hadoop is the most widely used framework to deal with big data. Due to this growth of huge amount of metadata, however, the efficiency of Hadoop is questioned numerous times by many researchers. Therefore, it is essential to create an efficient and scalable metadata management for Hadoop. Hash-based mapping and subtree partitioning are suitable in distributed metadata management schemes. Subtree partitioning does not uniformly distribute workload among the metadata servers, and metadata needs to be migrated to keep the load roughly balanced. Hash-based mapping suffers from a constraint on the locality of metadata, though it uniformly distributes the load among NameNodes, which are the metadata servers of Hadoop. In this paper, we present a circular metadata management mechanism named dynamic circular metadata splitting (DCMS). DCMS preserves metadata locality using consistent hashing and locality-preserving hashing, keeps replicated metadata for excellent reliability, and dynamically distributes metadata among the NameNodes to keep load balancing. NameNode is a centralized heart of the Hadoop. Keeping the directory tree of all files, failure of which causes the single point of failure (SPOF). DCMS removes Hadoop’s SPOF and provides an efficient and scalable metadata management. The new framework is named ‘Dr. Hadoop’ after the name of the authors.

Fig. 3 System architecture. Physical NameNode servers compose a metadata cluster to form a DCMS overlay network /* File Information mapping of: file or directory path - nodeURL */ public static Hashtable<FilePath, MetaData> fileInfo = new Hashtable<FilePath, MetaData>(); /* Cluster information mapping of: nodeURL and ClusterInfo object */ public static Hashtable<String, ClusterInfo> clusterinfo = new Hashtable<String, ClusterInfo>(); /* Data structure for storing all the NameNode servers */ public Hashtable<NameNode_hostname, Namenode-URL>namenode = new Hashtable<NameNode_hostname, Namenode-URL>();

In DCMS, each NameNode possesses equal priority and hence the cluster shows pure decentralized behavior. The typical system architecture is shown in Fig. 3. The DCMS is a cluster of NameNodes where each of them is organized in a circular fashion. Each NameNode's hostname is denoted by NameNode X which has neighbors, viz., NameNode_ (X-1) and NameNode_ (X+1). The hash function is sufficiently random. This is SHA1 in our case. Many keys are inserted, due to the nature of consistent hashing; these keys will be evenly distributed across the various NameNode servers. DCMS improves the scalability of consistent hashing by avoiding the requirement that every node should know about every other node in the cluster. However, in DCMS, each node needs routing information of two nodes, which are its left and right nodes in topological order. This is because each NameNode will put a replica of its hashtable to its two neighbor servers. The replication management portion is discussed in the following section.