<img src="https://www.zjujournals.com/xueshu/fitee/images/2095-9184/images/logo.png" class="img-responsive">

Fig. 1 The skeleton of a NameNode server in DCMS

DCMS implements metadata balancing across the servers using consistent hashing and excellent namespace locality using LpH that provides the most efficient metadata cluster management. A better LpH based cluster exploits the minimum number of remote procedure calls (RPCs) for read and write lookups.

All these features and approaches are described in the subsequent portions of this paper.

3.1 DCMS features

DCMS simplifies the design of Hadoop and NameNode cluster architecture and removes various bottlenecks by addressing all difficult problems:

1. Load balance: DCMS uses a consistent hashing approach, spreading keys uniformly all over the NameNodes; this contributes a high degree of typical load balance.

2. Decentralization: DCMS is fully distributed: all the nodes in the cluster have equal importance. This improves robustness and makes DCMS appropriate for loosely organized metadata lookup.

3. Scalability: DCMS provides infinite scalability to the Dr. Hadoop framework. With the increase in the size of metadata, only the nodes (NameNode) in the cluster need to be added. No extra parameter modification is required to achieve this scaling.

4. Two-for-one: The left and right NameNodes of a NameNode keep the metadata of the middle one and behave as the 'hot-standby' (Apache Software Foundation, 2012) to it.

5. Availability: DCMS uses a Resource-Manager to keep track of all the NameNodes which are newly added, as well as the failure ones. During failure, the left and right sides of the failure NameNode of DCMS can serve for the lookup of the failure one. The failure node will be replaced by electing a DataNode as NameNode using a suitable election algorithm. That is why there is no downtime on DCMS. The failure NameNode will serve as the DataNode, if recovered later.

6. Uniform key size: DCMS puts no restriction on the structure of the keys in the hashtable; the DCMS key-space is flat. It uses SHA1 on each file or directory path to generate a unique identifier of k bytes. So, even a huge file/directory pathname does not take more than k bytes of key.

7. Self-healing, self-managing, and self-balancing: In the DCMS, the administrator need not worry about the increase in the number of NameNodes or replacement of NameNodes. The DCMS has the feature of self-healing, self-balancing, and self-managing of metadata and the MetaData server.

3.2 Overview

DCMS is an in-memory hashtable-based system, in which a key-value pair is shown in Fig. 2. It is designed to meet the following five general goals: (1) high scalability of the NameNode cluster, (2) high availability with a suitable replication management, (3) excellent namespace locality, (4) fault tolerance capacity, and (5) dynamic load balancing.

Fig. 2

Fig. 2 Key-value system. Key is SHA1 (pathname), while the value is its corresponding metadata

In DCMS design, each NameNode stores one unique hashtable. Like other hash-based mapping, it uses a hashing concept to distribute the metadata across multiple NameNodes in the cluster. It maintains a directory hierarchical structure such that metadata of all the files in a common directory gets stored in the same NameNode. For distribution of metadata, it uses locality-preserving hashing (LpH) which is based on the pathname of each file or directory. Due to the use of LpH in DCMS, it eliminates the overhead of directory hierarchy traversal. To access data, a client hashes the pathname of the file with the same locality-preserving hash function to locate which metadata server contains the metadata of the file, and then contacts the appropriate metadata server. The process is extremely efficient, typically involving a single message to a single NameNode. DCMS applies the SHA1 () hash function on the pathname of each file or directory's full path. SHA1 stands for 'secure hash algorithm', which generates a unique 20-byte identifier for every different input.

In HDFS, all the files are divided into pre-defined sizes of multiple chunks, called blocks. The SHA1 () is applied to each block to generate a unique 20-byte identifier, which is stored in the hashtable as the key. For example, the identifiers of the first and third blocks of file '/home/dip/folder/fold/fold1/file.txt' are SHA1 (home/dip/folder/fold/fold1/file.txt, 0) and SHA1 (home/dip/folder/fold/fold1/file.txt, 2), respectively.

3.3 System architecture

In DCMS, each NameNode possesses equal priority and hence the cluster shows pure decentralized behavior. The typical system architecture is shown in Fig. 3. The DCMS is a cluster of NameNodes where each of them is organized in a circular fashion. Each NameNode's hostname is denoted by NameNode X which has neighbors, viz., NameNode_ (X-1) and NameNode_ (X+1). The hash function is sufficiently random. This is SHA1 in our case. Many keys are inserted, due to the nature of consistent hashing; these keys will be evenly distributed across the various NameNode servers. DCMS improves the scalability of consistent hashing by avoiding the requirement that every node should know about every other node in the cluster. However, in DCMS, each node needs routing information of two nodes, which are its left and right nodes in topological order. This is because each NameNode will put a replica of its hashtable to its two neighbor servers. The replication management portion is discussed in the following section.

Fig. 3

Fig. 3 System architecture. Physical NameNode servers compose a metadata cluster to form a DCMS overlay network

/* File Information mapping of: file or directory path - nodeURL */

public static Hashtable<FilePath, MetaData> fileInfo = new Hashtable<FilePath, MetaData>();

/* Cluster information mapping of: nodeURL and ClusterInfo object */

public static Hashtable<String, ClusterInfo> clusterinfo = new Hashtable<String, ClusterInfo>();

/* Data structure for storing all the NameNode servers */

public Hashtable<NameNode_hostname, Namenode-URL>namenode = new Hashtable<NameNode_hostname, Namenode-URL>();

The DCMS holds three primary data structures: namenode, fileinfo, and clusterinfo. These data structures are the building blocks of the DCMS metadata. The namenode and clusterinfo are stored on the typical Resource-Manager (JobTracker) of typical Hadoop architecture, and fileinfo is stored on each namenode which handles the mapping to file to its paths in the DataNodes.

1. namenode: The namenode data structure stores the mapping of the NameNode's hostname to NameNode-URL. It is used to obtain the list of all the servers to keep track of all the NameNodes in the DCMS cluster. The Resource-Manager holds this data structure to track all the live NameNodes in the cluster by a heartbeat mechanism. It must update this mapping table information when a new NameNode joins or leaves the DCMS cluster. The Dr. Hadoop framework can be altered by reforming this data structure.

2. fileinfo: The fileinfo data structure, stored in each NameNode of the DCMS cluster, keeps the mapping of the SHA1 () of the file or directory path to the corresponding location where they are stored in the DataNodes. The mapping is basically the key-value stored in the hashtable. The mapping locates the DataNodes on the second layer. The mapped fileinfo metadata is used to access DataNodes on the cluster where the files are actually stored.

3. clusterinfo: The clusterinfo data structure, stored in the Resource-Manager, stores the basic information about the cluster such as storage size, available space, and the number of DataNodes.

3.4 Replication management

In Dr. Hadoop we replicate each hashtable of NameNodes to the different NameNodes. The idea of replication management of Dr. Hadoop in DCMS is that each replica starts from a particular state, receives some input (key-value), goes to the same state, and outputs the result. In DCMS, all the replicas possess the identical metadata management principle. The input here means only the metadata write requests, because only file/directory write requests can change the state of a NameNode server. The read request needs only a lookup.

We designate the main hashtable as primary, and the secondary hashtable replicas are placed in the left and right NameNodes of the primary. The primary is given the charge of the replication of all metadata updates and to preserve the consistency with the other two. As shown in Fig. 4, NameNode_ (X+1) stores the primary hashtable HT_ (X+1) and its secondary replicas are stored in NameNode X and NameNode_ (X+2) as its left and right NameNodes, respectively. These two nodes, hosting the in-memory replicas, will behave as hot-standby to NameNode_ (X+1).

Fig. 4

Fig. 4 Replication management of Dr. Hadoop in DCMS

The metadata requests are not concurrent in Dr. Hadoop in order to preserve metadata consistency. When a primary NameNode processes the metadata write requests, it first puts the replicas to its secondary before sending the acknowledgement back to the client. Details about these operations are explained later.

The secondary replica hashtable does not need to process any write metadata request. The clients send only the metadata write requests to the primary hashtable, i.e., HT_ (X+1) in the figure. However, the read requests are processed by any replica, either primary or secondary, i.e., HT_X or HT_ (X+2) in the figure, to increase the throughput of read.

4 Hashing approaches of DCMS

4.1 Consistent hashing

The consistent hash function assigns each NameNode and key (hashtable) an n-bit identifier using a base hash function such as SHA1 (U.S. Department of Commerce/NIST, 1995). A NameNode's identifier is chosen by hashing its IP address, while a key identifier for the hashtable is produced by hashing the key.

As explained in an earlier section, DCMS consists of a circular arrangement of NameNodes. Assume n bits are needed to specify an identifier. We applied consistent hashing in DCMS as follows: identifiers are ordered in a HashCode module 2 n. Key k is assigned to NameNode_0 if this result is 0, and so on. This node is called the home node of key k, denoted by home (k). If identifiers are denoted by 0 to 2 n-1, then home (k) is the first NameNode of the circle having hostname NameNode_0.

The following results are proven in the papers that introduced consistent hashing (Karger et al., 1997; Lewin, 1998) and that are the foundation of our architecture.

Theorem 1 For any set of N nodes and K keys, with high probability, we have

1. Each node is responsible for at most (1 +ε) K/N keys.

2. When an (N + 1) th node joins or leaves the network, responsibility for O (K/N) keys changes hands (and only to or from the joining or leaving node).

With SHA1, this ε can be reduced to a negligible value providing a uniform K/N key distribution to each NameNode in the DCMS.

We need to discuss the term 'with high probability' mentioned in the theorem. As the NameNodes and keys in our model are randomly chosen, the probability distribution over the choices of keys is very likely to produce a balanced distribution. However, if an adversary chooses all the keys to hash to an identical identifier (NameNode), the situation would produce a highly unbalanced distribution, leading to destruction of the load balancing property. The consistent hashing paper uses 'k universal hash functions', which guarantees pure load balancing even when someone uses non-random keys.

Instead of using a 'universal hash function', the DCMS uses the standard SHA1 function as our base hash function. This will make the hashing more deterministic, so that the phrase 'high probability' no longer makes sense. However, it should be noted that, collision in SHA1 is seen, but with a negligible probability.

4.1.1 Collision probability of SHA1

Considering random hash values with a uniform distribution, a hash function that produces n bits and a collection of b different blocks, the probability p that one or more collisions will occur for different blocks is given by

(0) $p \le \frac{{b(b - 1)}}{2}\frac{1}{{{2^n}}}$

For a huge storage system that contains 1 PB (10¹⁵ bytes) of data or an even larger system that contains 1 EB (10¹⁸ bytes) stored as 8 KB blocks (10¹⁴ blocks), using the SHA1 hash function, the probability of a collision is less than 10^-20. Such a scenario seems sufficiently unlikely that we can ignore it and use the SHA1 hash as a unique identifier for a block.

4.2 Locality-preserving hashing

In EB-scale storage systems, we should implement near-optimal namespace locality by assigning keys that are consistent based on their full pathnames to the same NameNode.

To attain near-optimal locality, the entire directory tree nested under a point has to reside on the same NameNode. The clients distribute the metadata for the files and directories over DCMS by computing a hash of the parent path present in the file operation. Using the parent path implies that the metadata for all the files in a given directory is present at the same NameNode. Algorithms 1 and 2 describe the creation and reading operations of a file by a client over DCMS.

Algorithm 1

Algorithm 1 Write operation of Dr. Hadoop in DCMS

Algorithm 2

Algorithm 2 Read operation of Dr. Hadoop in DCMS

Algorithms 1 and 2 show the write and read operations respectively in Dr. Hadoop on DCMS to preserve the locality-preserving hashing.

In Algorithms 1, the client sends the query as a file path or directory path to Dr. Hadoop and it outputs the data in the DataNodes as well as metadata in the NameNodes of DCMS with proper replication and consistency.

As mentioned in Section 3.4, the hashtable of NameNode_X will be denoted as Hashtable_X. When a client issues a write operation of '/dir1/dir2/dir3/file.txt', it hashes the parent path, i.e., hash of '/dir1/dir2/dir3', so that all the files that belong to that particular directory reside in the same NameNode. If the hash of the above query outputs 3, then the metadata of that file will get stored in the Hashtable_3. For synchronous one-copy replication purposes, that metadata will be replicated to Hashtable_2 and Hashtable_1. After the whole process is completed, Dr. Hadoop will send an acknowledgement to the client.

The read operation of Dr. Hadoop portrayed in Algorithms 2 to preserve LpH follows a similar trend to Algorithms 1. When the client issues a read operation on '/dir1/dir2/dir3/file.txt', it would again make a hash of its parent directory path. If during the write this value was 3, it will now output the same value. So, as per the algorithm, the client will automatically contact again NameNode_3 where the metadata actually resides. Now, as the metadata is stored in three consecutive hashtables, DCMS provides a parallel reading capacity, i.e., from Hashtable_2, Hashtable_3, and Hashtable_4. The application performs getBlock () in parallel each time to obtain the metadata of each block of the file. This hugely increases the throughput of the read operation of Dr. Hadoop.

Fig. 5 shows the step-by-step operation to access files by clients in Dr. Hadoop.

Fig. 5

Fig. 5 Client accessing files on the Dr. Hadoop framework

Step 1: Client wishes to create a file named '/dir1/dir2/filename'. It computes a hash of the parent path, '/dir1/dir2', to determine which NameNode has to contact .

Step 2: Before returning the response back to the client, NameNode sends a replication request to its left and right topological NameNodes to perform the same operation.

Step 3: Look up or add a record to its hashtable and the file gets stored/retrieved in/from DataNodes.

Step 4: NameNode sends back the response to the client.

5 Scalability and failure management

5.1 Node join

In a DCMS-like dynamic network, new NameNodes join in the cluster whenever the metadata limit exceeds the combined main memory size (considering the replication factor) of all the NameNodes. While doing so, the main challenge for this is to preserve the locality of every key k in the cluster. To achieve this, DCMS needs to ensure two things:

1. Node's successor and predecessor are properly maintained.

2. home (k) is responsible for every key k of DCMS.

In this section we discuss how to conserve these two factors while adding a new NameNode to the DCMS cluster.

For the join operation, each NameNode maintains a successor and predecessor pointer. A NameNode's successor and predecessor pointer contains the DCMS identifier and IP address of the corresponding node. To preserve the above factors, DCMS must perform the following three operations when a node n, i.e., NameNode_N, joins the network:

1. Initialize the successor and predecessor of NameNode_N.

2. Update these two pointers of Name-Node_(N-1) and NameNode_0.

3. Notify the Resource-Manager to update its namenode data structure about the addition of NameNode_N.

5.2 Transferring replicas

The last operation to be executed when a NameNode_n joins DCMS is to shift the reasonability of holding the replicas (hashtable) of its new successor and predecessor. Typically, this involves moving the data associated with the keys to keep the balance of the hashtable in DCMS for Dr. Hadoop. Algorithm 3 and Fig. 6 provide the detailed overview of this operation.

Fig. 6

Fig. 6 Dynamic nature of Dr. Hadoop: (a) initial setup of DCMS with half filled RAM; (b) addition of NameNode in DCMS when RAM gets filled up

Algorithm 3

Algorithm 3 Node join operation of Dr. Hadoop: addnew_NameNode_N()

5.3 Failure and replication

The number of NameNodes in DCMS of Dr. Hadoop would not be very large. So, we use heartbeat (Aguilera et al., 1997) to do failure detection. Each NameNode sends a heartbeat to the Resource-Manager every 3 s. The Resource-Manager detects a NameNode failure by checking every 200 s if any node has not sent the heartbeat for at least 600 s. If a NameNode_N is declared a failure, other NameNodes whose successor and predecessor pointers contain NameNode_N must adjust their pointer accordingly. In addition, the failure of the node should not be allowed to distort the metadata operation in progress, as the cluster will be in a re-stabilized situation.

After a node fails, but before the stabilization takes place, a client might send a request to this NameNode for a lookup (key). Generally, these lookups would be processed after a timeout, but from another NameNode (its left or right) which will be acting as a hot standby for it, despite that failure. This case is possible because the hashtable has its replica stored in its left and right nodes.

6 Analysis

6.1 Design analysis

Let the total number of nodes in the cluster be n. In traditional Hadoop, there is only 1 primary NameNode, 1 Resource-Manager, and (n-2) DataNodes to store the actual data. The single NameNode stores all the metadata of the framework. Hence, it becomes a centralized NameNode which cannot tolerate the crash of a single server.

In Dr. Hadoop, we use m NameNodes that distribute the metadata r times. So, Dr. Hadoop can handle the crash of (r-1) NameNode servers. In traditional Hadoop, 1 remote procedure call (RPC) is enough for lookup using a hash of the file/directory path needed to create a file. Using Algorithm 1, however, Dr. Hadoop needs 3 RPCs to create a file and 1 RPC to read a file.

Dr. Hadoop uses m NameNodes to handle the read operation. The throughput of read metadata is thus m times that of the traditional Hadoop. Similarly, the throughput of write metadata operation is m/r times that of traditional Hadoop (m/r is calculated because the metadata is distributed over m NameNodes but with an overheard of replication factor r). The whole analysis and comparison are tabulated in Table 1.

Table 1 Analytical comparison of traditional Hadoop and Dr. Hadoop

Parameter	Traditional Hadoop	Dr. Hadoop
Maximum number of NameNode crushes that can survive	0	r — 1
Number of RPCs needed for read operation	1	1
Number of RPCs needed for write operation	1	r
Metadata storage per NameNode	X	(X/r)m
Throughput of metadata read	X	Xm
Throughput of metadata write	X	X (m/r)

6.2 Failure analysis

In the failure analysis of Dr. Hadoop, the failure of NameNode is assumed to be independent of one another and be equally distributed over time. We ignore the failures of DataNodes in this study because they have the same effect on Hadoop and Dr. Hadoop, which thus simplifies our failure analysis.

6.2.1 Traditional Hadoop

Mean time between failure (MTBF) is defined by the mean time in which a system is supposed to fail. So, the probability that a NameNode server fails in a given time is denoted by

(1)

$f = \frac{1}{{{\rm{MTBF}}}}.$

Let ${R}_\mathrm{Hadoop}$ be the time to recover from the failure. Therefore, the traditional Hadoop framework will be unavailable for ${fR_\mathrm{Hadoop}}$ of the time.

If NameNode fails once a month and takes 6 h to recover from the failure,

(2)

$\begin{array}{*{20}{l}}{f = 1/(30 \times 24) = 1.38 \times {{10}^{ - 3}},{R_{{\rm{Hadoop}}}} = 6,}\\{\;\;\;\;\;\;\;\;\;\;\;{\rm{Unavailability}} = f \cdot {R_{{\rm{Hadoop}}}}.}\end{array}$

Thus, $\mathrm{Unavailability} = 1.38\times10^{-3}\times6=8.28\times10^{-3}$. So, Availability = (100-0.828) ×% =99.172%.

6.2.2 Dr. Hadoop

In Dr. Hadoop, say there are m metadata servers used for DCMS and let r be the replication factor used. Let f denote the probability that a given server fails at a given time t and ${R_\mathrm{Dr. Hadoop}}$ the time to recover from the failure.

Note that approximately we have

(3)

${R_{{\rm{Dr}}.{\rm{Hadoop}}}} = \frac{{r{R_{{\rm{Hadoop}}}}}}{m}.$

This is because the recovery time of metadata is directly proportional to the amount of metadata stored in its system. As per our assumption, the metadata is replicated r times.

Now, the probability of failure of any r consecutive NameNodes in DCMS can be calculated using the binomial distribution of the probability. Let P be the probability of this happening. We have

(4)

$P = m{f^r}{(1 - f)^{m - r}}.$

Let there be 10 NameNodes and let each metadata be replicated three times, i.e., r=3, and let the value of f be 0.1 in three days. Putting these values in the equation results in a P of 0.47%.

In Dr. Hadoop a portion of the system becomes offline if and only if the r consecutive NameNodes of DCMS fail within the recovery time of one another.

This situation occurs with a probability

(5)

${F_{{\rm{Dr}}.{\rm{Hadoop}}}} = mf{\left({\frac{{f{R_{{\rm{Dr}}.{\rm{Hadoop}}}}}}{t}} \right)^r}{(1 - f)^{m - r}}.$

To compare the failure analysis with that of traditional Hadoop, we take the failure probabilityf much less than in earlier calculation. Let f be 0.1 in three days, m be 10, and for DCMS, r=3.

$R_\mathrm{Dr. Hadoop}$ is calculated as

${R_{{\rm{Dr}}.{\rm{Hadoop}}}} = \frac{{r{R_{{\rm{Hadoop}}}}}}{m} = 3 \times 6/10 = 1.8.$

So, the recovery time of Dr. Hadoop is 1.8 h, while in the case of Hadoop, it is 6 h.Now,

${F_{{\rm{Dr}}{\rm{.Hadoop}}}} = 10 \times 0.1 \times {\left({\frac{{0.1 \times 1.8}}{{3 \times 24}}} \right)^3}{(1 - 0.1)^{10 - 3}}.$

The above gives

${F_{{\rm{Dr}}{\rm{.Hadoop}}}} = 7.46 \times {10^{ - 9}}.$

The file system of Dr. Hadoop is unavailable for $F_\mathrm{Dr.Hadoop}R_\mathrm{Dr.Hadoop}$ of the time.

$F_\mathrm{Dr.Hadoop}R_\mathrm{Dr.Hadoop}=7.46 \times 10^{-9} \times 1.8 = 1.34\times 10^{-8}$. Thus, Availability = 99.99%.

So, this shows the increase in the availability of Dr. Hadoop over traditional Hadoop, which is 99.172%. The improvement of recovery time is also shown and proved.

7 Performance evaluation

This section provides the performance evaluation of DCMS of Dr. Hadoop using trace-driven simulation. Locality of namespace is first carefully observed and then we perform the scalability measurement of DCMS. Performance evaluation of DCMS against locality preservation is compared with: (1) FileHash in which files are randomly distributed based on their pathnames, each of them assigned to an MDS; (2) DirHash in which directories are randomly distributed just like in FileHash. Each NameNode identifier in the experiment is 160 bits in size, obtained from the SHA1 hash function. We use real traces as shown in Table 2. Yahoo means traces of NFS and email by the Yahoo finance group and its data size is 256.8 GB (including access pattern information). Microsoft means traces of Microsoft Windows production (Kavalanekar et al., 2008) of build servers from BuildServer00 to BuildServer07 within 24 h, and its data size is 223.7 GB (access pattern information included). A metadata crawler is applied to the datasets that recursively extract file/directory metadata using the stat () function.

Table 2 Real data traces

Trace	Number of files	Data size (GB)	Metadata extracted (MB)
Yahoo	8 139 723	256.8	596.4
Microsoft	7 725 928	223.7	416.2

The cluster used for all the experiments has 40 DataNodes, 1 Resource-Manager, and 10 NameNodes. Each node is running at 3.10 GHz clock speed with 4 GB of RAM and a gigabit Ethernet NIC. All nodes are configured with a 500 GB hard disk. Ubuntu 14.04 is used as our operating system. Hadoop 2.5.0 version is configured for comparison between Hadoop and Dr. Hadoop keeping the HDFS block size as 512 MB.

For our experiments, we have developed an environment for different events of our simulator. The simulator is used for validation of different design choices and decisions.

To write the code for the simulator, the system (Dr. Hadoop) was studied in depth and different entities and parameters such as replication_factor, start_time, end_time, and split_path are identified. After this, the behaviors of the entities are specified and classes for each entity are defined.

The different classes for simulation are placed on an in-built event-driven Java simulator, SimJava (Fred and McNab, 1998). SimJava is a popular Java toolkit which uses multiple threads for the simulation. As our model is based on an object-oriented system, the multiple threaded simulator was preferred.

In terms of resource consumption, SimJava gives the best efficiency for long running simulations. For experiments with a massive amount of data, it might run out of system memory. This does not occur in our case, because we basically deal with metadata and the simulator delivers its best result when we verify it with a real-time calculation.

The accuracy of the designed simulator is measured based on the real-time calculation, using benchmarking based on the nature of the experiment. For failure analysis, the real-time calculation is done and the result is compared with that found from the simulator. For other experiments, benchmarks are executed and a mismatch ranging from milliseconds to tens of seconds is observed. The overall results show a mismatch of around 1.39% on average between the designed simulator and a real-time device.

To better understand the accuracy of the simulator, in Fig. 8a, a comparison between simulator results and real-time results is shown.

Fig. 7

Fig. 7 Locality comparisons of paths at three levels over two traces in the cluster with 10 NameNodes: (a) Yahoo trace; (b) Microsoft Windows trace

Fig. 8

Fig. 8 Linear growth of metadata for Hadoop and Dr. Hadoop load per NameNode: (a) Microsoft Windows trace; (b) Yahoo trace

7.1 Experiment on locality-preserving

DCMS uses LpH which is the finest feature of Dr. Hadoop. Excellent namespace locality of Dr. Hadoop's MDS cluster is necessary to reduce the I/O request of a metadata lookup. In this experiment, we have used a parameter, locality =$\sum\nolimits_{j = 1}^m {{p_{ij}}} $, where $p_{ij}$ (either 0 or 1) represents whether a subtree (directory) path $p_{i}$ is stored in NameNode j or not. Basically, this metric shows the total number of times a subtree path is split across the NameNodes. Fig. 7 portraits the average namespace locality comparison of paths at three different levels on two given traces using three metadata distribution techniques.

Figs. 7a and 7b show that the performance of DCMS is significantly improved over FileHash and DirHash for the given two traces. This is because DCMS achieves optimal namespace locality using LpH; i.e., keys are assigned based on the order of the pathnames. In contrast, in the cases of FileHash and DirHash, the orders of pathnames are not maintained, so namespace locality is not preserved at all.

7.2 Scalability and storage capacity

The scalability of Dr. Hadoop is analyzed with the two real traces as shown in Table 2. The growth of metadata (namespace) is studied with the increasing data uploaded to the cluster in HDFS. These observations are tabulated in Tables 3 , 4 , and 5 for Hadoop and Dr. Hadoop with 10 GB data uploaded on each attempt. The metadata size of Table 5 is 3 times its original size for Yahoo and Microsoft respectively because of the replication. Figs. 8a and 8b represent the scalability of Hadoop and Dr. Hadoop in terms of load (MB) /NameNode for both datasets. The graphs show a linear increment in the metadata size of Hadoop and Dr. Hadoop. In traditional Hadoop, with the increase in data size in the DataNodes, the metadata is likely to grow to the upper bound of the main memory of a single NameNode. So, the maximum limit of data size that DataNodes can afford is limited to the size of available memory on the single NameNode server. In Dr. Hadoop, DCMS provides a cluster of NameNodes, which reduces the metadata load rate per NameNode for the cluster. This results in enormous increase in the storage capacity of Dr. Hadoop.

Table 3 Incremental data storage vs. namespace size for the traditional Hadoop cluster (Yahoo trace)

Data (GB)	namespace (MB)
10	29.80
20	53.13
30	77.92
40	96.18
50	119.02
60	142.11
70	159.17
80	181.67
90	203.09
100	229.37
110	251.04
120	279.30
130	299.82
140	318.33
150	337.22
160	356.71
170	373.18
180	394.22
190	426.13
200	454.76
210	481.01
220	512.16
230	536.92
240	558.23
250	570.12
256.8	594.40

Table 4 Incremental data storage vs. namespace size for the traditional Hadoop cluster (Microsoft trace)

Data (GB)	namespace (MB)
10	20.50
20	41.83
30	59.03
40	77.08
50	103.07
60	128.18
70	141.90
80	157.19
90	174.34
100	190.18
110	214.20
120	237.43
130	254.12
140	271.89
150	297.12
160	312.71
170	329.11
180	352.12
190	369.77
200	384.76
210	393.04
220	401.86
223.7	416.02

Table 5 Data storage vs. namespace size for the Dr. Hadoop cluster

Data (GB)	Namespace (MB)
Data (GB)	Yahoo	Microsoft
100	684.11	572.65
200	1364.90	1152.89
256.8	1789.34	1248.02

Due to the scalable nature of DCMS in Dr. Hadoop, each NameNode's load is inconsequential in comparison to that of traditional Hadoop. In spite of three times replication of each metadata in DCMS, the load/NameNode of Dr. Hadoop shows a drastic decline compared to Hadoop's NameNode. Fewer loads on the main memory of a node implies that it is less prone to failure.

In Fig. 8a, we have also compared the value obtained by the simulator with that by real-time analysis. As the result suggests, the percentage of mismatch is very little and lines for Hadoop and Hadoop_RT are almost proximate. This validates the accuracy of the simulator.

7.3 Experiment on throughput

In Dr. Hadoop, each NameNode of DCMS behaves as both a primary NameNode and a hot standby node for its left and right NameNodes. During failover, the hot standby takes over the responsibility of metadata lookup. That might affect the throughput of read and write operations. These two metrics are most imperative for our model as they dictate the load of work performed by the clients. We first performed the experiment of throughput for read/write when they are in no failure. Later, the throughput (operation/s) is obtained during the NameNode failure circumstances to evaluate the performance on fault tolerance.

7.3.1 Read/write throughput

This experiment demonstrates the throughput of metadata operations during no failover. A multithreaded client is configured, which sends metadata operations (read and write) at an appropriate frequency and the corresponding successful operations are measured. We measure the successful completion of metadata operation per second to compute Dr. Hadoop's efficiency and capacity.

Figs. 9a and 9b show the average read and write throughput in terms of successful completion of operations for Hadoop and Dr. Hadoop for both data traces. Dr. Hadoop's DCMS throughput is significantly higher than that of Hadoop. This validates our claim in Table 1. The experiment is conducted using 10 NameNodes; after few seconds in Dr. Hadoop, the speed shows some reduction only because of the extra RPC involved.

Fig. 9

Comparison of throughput under different load conditions: (a) Yahoo trace; (b) Microsoft Windows trace

7.3.2 Throughput against fault tolerance

This part demonstrates the fault tolerance of Dr. Hadoop using DCMS. A client thread is made to send 100 metadata operations (read and write) per second, and successful operations against these for Hadoop and Dr. Hadoop are displayed in Figs. 10a and 10b.

Fig. 10

Fault tolerance of Hadoop and Dr. Hadoop: (a) Yahoo trace; (b) Microsoft Windows trace

In Fig. 10a, the experiment is carried out on a Yahoo data trace, where Hadoop shows a steady state throughput initially. We kill the primary NameNode's daemon at t=100 s and eventually the whole HDFS goes offline. At around t=130 s, the NameNode is again restarted and it recovers itself from the check-pointed state from the secondary NameNode and repeats the log operations that it failed to perform. During the recovery phase, there are few spikes because the NameNode buffers all the requests until it recovers and batches them all together.

To test the fault tolerance of DCMS in Dr. Hadoop, out of 10 NameNodes, we randomly kill the daemons of a few, viz., NameNode_0 and NameNode_6 at t=40 s. This activity does not lead to any reduction of successful completion of metadata operations because the neighbor NameNodes act as hot standby to the killed ones. Again, at t=70 s, we kill NameNode_3 and NameNode_8, leaving 6 NameNodes out of 10, which reflects an amount of declination in throughput. At t=110 s, we restart the daemons of two NameNodes and this stabilizes the throughput of Dr. Hadoop.

An experiment with the Microsoft data trace is shown in Fig. 10b. It shows a similar trend to the Yahoo trace for Hadoop. During the test for Dr. Hadoop, we kill two sets of three consecutive NameNodes, NameNode_1 to NameNode_3 and NameNode_6 to NameNode_8, at t=90 s to acquire the worst case scenario (with a probability of almost $10^{-8}$). This made a portion of the file system unavailable, and throughput of Dr. Hadoop gives a poorest value. At t=110 s we again start NameNode_1 to NameNode_3 to increase its throughput. Again, at t=130 s we start NameNode_6 to NameNode_8, which eventually re-stabilizes the situation.

7.4 Metadata migration in a dynamic situation

In DCMS, addition and deletion of files or directories occur very frequently. NameNodes join and leave the system, the metadata distribution of Dr. Hadoop changes, and thus DCMS might have to perform migration of metadata to maintain consistent hashing. DCMS needs to justify two situations: the first is how DCMS behaves with the storage load/NameNode when the numbers of servers are constant. The second is how much key distribution or migration overhead is needed to maintain the proper metadata load balancing. The experiment on scalability and storage capacity has answered the first problem sufficiently.

Fig. 11 depicts the metadata migration overhead showing outstanding scalability. We perform the experiment as follows. All the NameNodes in the DCMS system are in an adequate load balancing state at the beginning, as discussed earlier in Section 4.1. In the experiment, two NameNodes are allowed to join the system randomly after a periodic time, and the system reaches a new balanced key distribution state. We examine how many metadata elements are migrated to the newly added servers.

Fig. 11

Fig. 11 Migration overhead

7.5 Hadoop vs. Dr. Hadoop with MapReduce job execution (WordCount)

In this experiment, a MapReduce job is executed to show the results for both frameworks. WordCount, a popular benchmark job, is executed on Hadoop and Dr. Hadoop (Fig. 12).

Fig. 12

Fig. 12 Wordcount job execution time with different dataset sizes

The job is executed on a Wikipedia dataset, with varying input sizes from 4 to 128 GB. In Hadoop, the single NameNode is responsible for the job execution. In contrast, in Dr. Hadoop, the JobTracker (master for DCMS) submits the job to any particular NameNode after mapping the input filepath. So, an extra step is involved in the case of Dr. Hadoop, due to which a slower execution is shown in comparison to Hadoop.

The overhead of replicating metadata in Dr. Hadoop is also a factor for this little delay. As the locality of metadata is not 100% in Dr. Hadoop, the metadata of subdirectories might be stored in different NameNodes. So, this might be a case of increased RPC, which increases the job execution time initially.

However, as the size of the dataset increases beyond 32 GB, the running time is found almost to neutralize all these factors.

During the execution, the rest of the Name-Nodes are freely available for other jobs. This is not for the case of Hadoop, which has a single Name-Node. So, it is quite practical to neglect the small extra overhead when considering a broader scenario.

7.8 Related work

Various recent studies for decentralization of the NameNode server are discussed in this section.

Providing Hadoop with a highly available metadata server is an urgent and topical problem in the Hadoop research community.

The Apache Foundation of Hadoop came out with a feature of secondary NameNode (White, 2009), which is purely optional. The secondary NameNode periodically checks NameNode's namespace status and merges the fsimage with edit logs. It decreases the restart time of the NameNode. Unfortunately, it is not a hot backup daemon of NameNode, thus not fully capable of hosting DataNodes in the absence of NameNode. So, it could not resolve the SPOF of Hadoop.

Apache Hadoop's core project team has been working on a backup NameNode that is capable of hosting DataNodes when NameNode fails. The NameNodes can have only one backup NameNode. It continually contacts the primary NameNode for synchronization and hence adds to the complexity of the architecture. According to the project team, they need a serious contribution from different studies for the backup NameNode contribution.

Wang et al. (2009) with a team from IBM China worked on NameNode's high availability by replicating the metadata. Their work is somewhat similar to our work, but the solution consists of three stages: initialization, replication, and failover. This increases the complexity of the whole system by having different systems for cluster management and for failure management. In our Dr. Hadoop, similar work is done with less overhead and with the same kind of machines.

A major project of Apache Hadoop (HDFS-976) was for the HA of Hadoop by providing a concept of the Avatar node (HDFS, 2010). The experiment was carried out on a cluster, which consists of over 1200 nodes. The fundamental approach of this concept was to switch between primary and standby NameNodes as if it were switched to an Avatar. For this, the standby Avatar node needs some support from the primary NameNode which creates an extra load and overhead to the primary NameNode because it is already exhausted by the huge client requests. This deteriorates the performance of the Hadoop cluster.

Zookeeper, a subproject of Apache Hadoop, uses replication among a cluster of servers and further provides a leader election algorithm for the coordination mechanism. However, the whole project focuses on coordination of distributed application rather than an HA solution.

ContextWeb performed an experiment on an HA solution of Cloudera Hadoop. DRBD was used from LINBIT and Heartbeat from a Linux-HA project, but the solution was not optimized for performance and availability of Hadoop. For example, it replicated numerous unnecessary data.

E. Sorensen added a hot standby feature to the Apache Derby (Bisciglia, 2009). Unfortunately, the solution does not assist a multi-threading approach to serve replicated messages. Therefore, it cannot be used in the parallel processing of large-scale data of Hadoop.

Okorafor and Patrick (2012) worked on HA of Hadoop, presenting a mathematical model. The model basically features zookeeper leader election. The solution covers only the HA of Resource-Manager, not the NameNode.

A systematic replica management was provided by Berkeley DB (Yadava, 2007), but it features only the database management system. If users want to use Berkeley DB replication for their application other than for a DBMS purpose, they have to spend a lot of time redesigning the replica framework.

IBM DB2 HADR (high availability disaster recovery) (high availability disaster recovery) (Torodanhan, 2009) can adjust replication management for efficiency at runtime. The solution uses a simulator to estimate the previous performance of a replica. According to the actual scenario of Hadoop, however, this approach is not suitable for tuning the replication process.

8 Conclusions

This paper presents Dr. Hadoop, a modification of the Hadoop framework by providing a dynamic circular metadata splitting (DCMS), a dynamic and excellent scalable distributed metadata management system. DCMS keeps excellent namespace locality by exploiting locality-preserving hashing to distribute the metadata among multiple NameNodes. Attractive features of DCMS include its simplicity, self-healing, correctness, and good performance when a new NameNode joins the system.

When the size of the NameNode cluster changes, DCMS uses consistent hashing that maintains the balance among the servers. DCMS has an additional replication management approach, which makes it most reliable in case of a node failure. Dr. Hadoop continues to work correctly in such a case, albeit with degraded performance when consecutive replicated servers fail. Our theoretical analysis on availability of Dr. Hadoop shows that this framework stays alive for 99.99% of the time.

DCMS offers multiple advantages, such as high scalability, efficient balancing of metadata storage, and no bottleneck with negligible additional overhead. We believe that Dr. Hadoop using DCMS will be a valuable component for large-scale distributed applications when metadata management is the most essential part of the whole system.

Reference

By original order

By published year

By cited within times

By Impact factor

[1]

Aguilera

, Chen

, Toueg

. et al. .

Heartbeat: a timeoutfree failure detector for quiescent reliable communication

1997 Proc. 11th Int. Workshop on Distributed Algorithms: 126 - 140 doi: 10.1007/BFb0030680

DOI:10.1007/BFb0030680 URL [Cited within: 1]

We study the problem of achieving reliable communication with quiescent algorithms (i.e., algorithms that eventually stop sending messages) in asynchronous systems with process crashes and lossy links. We first show that it is impossible to solve this problem without failure detectors. We then show how to solve it using a new failure detector, called heartbeat . In contrast to previous failure detectors that have been used to circumvent impossibility results, the heartbeat failure detector is implementable, and its implementation does not use timeouts. These results have wide applicability: they can be used to transform many existing algorithms that tolerate only process crashes into quiescent algorithms that tolerate both process crashes and message losses. This can be applied to consensus, atomic broadcast, k -set agreement, atomic commitment, etc. The heartbeat failure detector is novel: besides being implementable without timeouts, it does not output lists of suspects as typical failure detectors do. If we restrict failure detectors to output only lists of suspects, quiescent reliable communication requires P[2], which is not implementable. Combined with the results of this paper, this shows that traditional failure detectors that output only lists of suspects have fundamental limitations.

[2]

Apache Software Foundation, Hot Standby for NameNode, 2012, Available from: http://issues.apache.org/jira/browse/HDFS-976

[3]

Beaver

, Kumar

, Li

. et al. .

Finding aneedle in haystack: Facebookąŕs photo storage

OSDI, 2010: 47-60

[4]

Biplob

, Sengupta

, Li

. et al. .

FlashStore: high throughput persistent key-value store

2010, Proc. VLDB Endowment: 1414 - 1425

We present FlashStore, a high throughput persistent key-value store, that uses flash memory as a non-volatile cache between RAM and hard disk. FlashStore is designed to store the working set of key-value pairs on flash and use one flash read per key lookup. As the working set changes over time, space is made for the current working set by destaging recently unused key-value pairs to hard disk and recycling pages in the flash store. FlashStore organizes key-value pairs in a log-structure on flash to exploit fast sequential writes. It uses an in-memory hash table to index them, with hash collisions resolved by a variant of cuckoo hashing. The in-memory hash table stores compact key signatures instead of full keys so as to strike tradeoffs between RAM usage and false flash reads. FlashStore can be used as a high throughput persistent key-value storage layer for a broad range of server class applications. We compare FlashStore with BerkeleyDB, an embedded key-value store application, running on hard disk and flash separately, so as to bring out the performance gain of FlashStore in not only using flash as a cache above hard disk but also in its use of flash aware algorithms. We use real-world data traces from two data center applications, namely, Xbox LIVE Primetime online multi-player game and inline storage deduplication, to drive and evaluate the design of FlashStore on traditional and low power server platforms. FlashStore outperforms BerkeleyDB by up to 60x on throughput (ops/sec), up to 50x on energy efficiency (ops/Joule), and up to 85x on cost efficiency (ops/sec/dollar) on the evaluated datasets.

[5]

Bisciglia C, Hadoop HA Configuration, 2009, Available from: http://www.cloudera.com/blog/2009/07/22/hadoop-haconfiguration/

[6]

Braam

RZPJ

. Lustre: a Scalable, High Performance File System, 2007, Inc, Cluster File Systems

[7]

Brandt

, Miller

, Long

DDE

. et al. .

Efficient metadata management in large distributed storage systems

IEEE Symp. on Mass Storage Systems. 2003: 290 - 298

[8]

Cao

, Chen

, Guo

. et al. .

Es2: a cloud data storage system for supporting both OLTP and OLAP

2011, Proc. IEEE ICDE: 291 - 302

Cloud computing represents a paradigm shift driven by the increasing demand of Web based applications for elastic, scalable and efficient system architectures that can efficiently support their ever-growing data volume and large-scale data analysis. A typical data management system has to deal with real-time updates by individual users, and as well as periodical large scale analytical processing, indexing, and data extraction. While such operations may take place in the same domain, the design and development of the systems have somehow evolved independently for transactional and periodical analytical processing. Such a system-level separation has resulted in problems such as data freshness as well as serious data storage redundancy. Ideally, it would be more efficient to apply ad-hoc analytical processing on the same data directly. However, to the best of our knowledge, such an approach has not been adopted in real implementation. Intrigued by such an observation, we have designed and implemented epiC, an elastic power-aware data-itensive Cloud platform for supporting both data intensive analytical operations (ref. as OLAP) and online transactions (ref. as OLTP). In this paper, we present ES2 - the elastic data storage system of epiC, which is designed to support both functionalities within the same storage. We present the system architecture and the functions of each system component, and experimental results which demonstrate the efficiency of the system.

[9]

Corbett

, Feitelson

The Vesta parallel file system

ACM Trans. Comput. Syst, 1996, 14(3): 225 - 264 doi: 10.1145/233557.233558

DOI:10.1145/233557.233558 URL [Cited within: 2]

This article, we describe the Vesta Parallel File System, first introduced by Corbett et al. [1993a]. Vesta introduces a new abstraction of parallel files, by which application programmers can express the required partitioning of file data among the processes of a parallel application. This reduces the need for synchronization and concurrency control and allows for a more streamlined implementation. Also, Vesta provides explicit control over the way data are distributed across the I/O nodes and allows the distribution to be tailored for the expected access patterns

[10]

DeCandia

, Hastorun

, Jampani

. et al. .

Dynamo: Amazon’s highly available key-value store

ACM SIGOPS Oper. Syst. Rev. 2007, 41(6): 205 - 220

Reliability at massive scale is one of the biggest challenges we face at Amazon.com, one of the largest e-commerce operations in the world; even the slightest outage has significant financial consequences and impacts customer trust. The Amazon.com platform, which provides services for many web sites worldwide, is implemented on top of an infrastructure of tens of thousands of servers and network components located in many datacenters around the world. At this scale, small and large components fail continuously and the way persistent state is managed in the face of these failures drives the reliability and scalability of the software systems. This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon's core services use to provide an "always-on" experience. To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.

[11]

Dev

, Patgiri

Performance evaluation of HDFS in big data management

Int. Conf. on High Performance Computing and Applications. 2014: 1 - 7

Size of the data used in today's enterprises has been growing at exponential rates from last few years. Simultaneously, the need to process and analyze the large volumes of data has also increased. To handle and for analysis of large datasets, an open-source implementation of Apache framework, Hadoop is used now-a-days. For managing and storing of all the resources across its cluster, Hadoop possesses a distributed file system called Hadoop Distributed File System(HDFS).HDFS is written completely in Java and is depicted in such a way that in can store Big Data more reliably, and can stream those at high processing time to the user applications. In recent days, Hadoop is used widely by popular organizations like Yahoo, Facebook and various online shopping market venders. On the other hand, Experiments on Data- Intensive computations are going on to parallelize the processing of data. None of them could actually achieve a desirable performance. Hadoop, with its Map-Reduce parallel data processing capability can achieve these goals efficiently [1]. This paper initially provides an overview of the HDFS in details. Later on, the paper reports the experimental work of Hadoop with the big data and suggests the various factors that affects the Hadoop cluster performance. Paper concludes with providing the different real field challenges of Hadoop in recent days and scope for future work.

[12]

Dev

, Patgiri

HAR+: archive and metadata distribution! Why not both?

2015, ICCCI (in press)

Size of the data used in today's enterprises has been expanding at a huge range from last few years. Simultaneously, the need to process and analyze the large volumes of data has also increased. Hadoop Distributed File System (HDFS), is an open source implementation of Apache, designed for running on commodity hardware to handle applications having large datasets (TB, PB). HDFS architecture is based on single master (Name Node), which handles the metadata for large number of slaves. To get maximum efficiency, Name Node stores all of the metadata in its RAM. So, when dealing with huge number of small files, Name Node often becomes a bottleneck for HDFS as it might run out of memory. Apache Hadoop uses Hadoop ARchive (HAR) to deal with small files. But it is not so efficient for multi-NameNode environment, which requires automatic scaling of metadata. In this paper, we have designed hashtable based architecture, Hadoop ARchive Plus (HAR+) using sha256 as the key, which is a modification of existing HAR. HAR+ is designed to provide more reliability which can also provide auto scaling of metadata. Instead of using one NameNode for storing the metadata, HAR+ uses multiple NameNodes. Our result shows that HAR+ reduces the load of a single NameNode in significant amount. This makes the cluster more scalable, more robust and less prone to failure unlike of Hadoop ARchive.

[13]

Escriva

, Wong

, Sirer

. et al. .

HyperDex: a distributed, searchable key-value store

ACM SIGCOMM Comput. Commun. Rev 2012, 42(4): 25 - 36 doi: 10.1145/2377677.2377681

DOI:10.1145/2377677.2377681

[14]

Fred

, McNab

SimJava: a discrete event simulation library for Java

Simul. Ser. 1998, 30: 51 - 56

Simjava is a toolkit for building working models of complex systems. It is based around a discrete event simulation kernel and includes facilities for representing simulation objects as animated icons on screen.simjava simulations may be incorporated as "live diagrams" into web documents. This paper describes the design, component model, applications and future of simjava . 1 Introduction Our motivation for writing simulations in Java was to allow "live diagrams" to be incorporated into web pages. We had been developing C++ based visual simulations of computer architectures and parallel software systems as part of the HASE package (Ibbett, Heywood and Howell 1995), and saw the emergence of Java as an opportunity to make simulation models more widely available and easily accessible. Like many of the other groups who have written Java simulation libraries, we based the system on an existing C++ simulation package. The design aim was for simjava to be a set of "simulation foundation c...

[15]

Ghemawat

, Gobioff

, Leung

. et al. .

The Google file system

2003 Proc. 19th ACM Symp. on Operating Systems Principles: 29 - 43

[16]

Haddad

Pvfs: a parallel virtual file system for Linux clusters

Linux J. 2000: 5

An introduction to the Parallel Virtual File System and a look at how one company installed and tested it.

[17]

Wiki, NameNode Failover, on Wiki Apache Hadoop.2012.Available from:http://wiki.apache.org/hadoop/NameNodeFailover

[18]

HDFS,Hadoop AvatarNode High Availability. 2010, Available from:http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html

[19]

Karger

, Lehman

, Leighton

. et al. .

Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web

1997, Proc. 29th Annual ACM Symp. on Theory of Computing: 654 - 663

We describe a family of caching protocols for distrib-uted networks that can be used to decrease or eliminate the occurrence of hot spots in the network. Our protocols are particularly designed for use with very large networks such as the Internet, where delays caused by hot spots can be severe, and where it is not feasible for every server to have complete information about the current state of the entire network. The protocols are easy to implement using existing network protocols such as TCP/fF,and require very little overhead. The protocols work with local control, make efficient use of existing resources, and scale gracefully as the network grows. Our caching protocols are based on a special kind of hashing that we call consistent hashing. Roughly speaking, a consistent hash function is one which changes nr.inimaflyas the range of the function changes. Through the development of good consistent hash functions, we are able to develop caching protocols which do not require users to have a current or even consistent view of the network. We believe that consistent hash functions may eventually prove to be useful in other applications such as distributed name servers and/or quorum systems. 1

[20]

Kavalanekar

, Worthington

, Zhang

. et al. .

Characterization of storage workload traces from production Windows Servers

2008, Proc. IEEE IISWC: 119 - 128

The scarcity of publicly available storage workload traces of production servers impairs characterization, modeling research, and development efforts across the storage industry. Twelve sets of storage traces from a diverse set of Microsoft Corporation production servers were captured using ETW (event tracing for windows) instrumentation. Windows server 2008 dramatically increases the breadth and depth of ETW instrumentation, and new trace capture and visualization tools are available in the Windows Performance Tools kit. Additional analytical tools were developed to analyze and visualize traces captured from Exchange, software build and release, Live Maps, MSN storage, security authentication, and display advertisement platform servers. This paper contains a first set of characterizations for these traces, including simple block-level statistics, multi-parameter distributions, rankings of file access frequencies, and more complex analyses such as temporal and spatial self-similarity measurements. Trace data visualizations enable the examination of workload parameters, subcomponents, phases, and deviations from predicted behavior.

[21]

Lewin

Consistent hashing and random trees: algorithms for caching in distributed networks

Master Thesis, Department of EECS, MIT. 1998

Thesis (M.S.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998. Includes bibliographical references (p. 106-108). by Daniel M. Lewin. M.S.

[22]

Lim

, Fan

, Andersen

. et al. .

SILT: a memory-efficient, high-performance key-value store

2011, Proc. 23rd ACM Symp. on Operating Systems Principles.

URL

SILT (Small Index Large Table) is a memory-efficient, high-performance key-value store system based on flash storage that scales to serve billions of key-value items on a single node. It requires only 0.7 bytes of DRAM per entry and retrieves key/value pairs using on average 1.01 flash reads each. SILT combines new algorithmic and systems techniques to balance the use of memory, storage, and computation. Our contributions include: (1) the design of three basic key-value stores each with a different emphasis on memory-efficiency and write-friendliness; (2) synthesis of the basic key-value stores to build a SILT key-value store system; and (3) an analytical model for tuning system parameters carefully to meet the needs of different workloads. SILT requires one to two orders of magnitude less memory to provide comparable throughput to current high-performance key-value systems on a commodity desktop system with flash storage.

[23]

McKusick

, Quinlan

GFS: evolution on fastforward

ACM Queue, 2009, 7(7): 10 - 20 doi: 10.1145/1594204.1594206

DOI:10.1145/1594204.1594206 URL [Cited within: 1]

Kirk McKusick and Sean Quinlan discuss the origin and evolution of the Google File System.

[24]

Miller

, Katz

Rama: an easy-to-use, highperformance parallel file system

Parall. Comput. 1997. 23(4-5): 419 - 446 doi: 10.1016/S0167-8191(97)00008-2

DOI:10.1016/S0167-8191(97)00008-2 URL [Cited within: 2]

Modem massively parallel file systems provide high bandwidth file access by striping files across arrays of disks attached to a few specialized I/O nodes. However, these file systems are hard to use and difficult to integrate with workstations and tertiary storage. RAMA addresses these problems by providing a high-performance massively parallel file system with a simple interface. RAMA uses hashing to pseudo-randomly distribute data to all of its disks, insuring high bandwidth regardless of access pattern and eliminating bottlenecks in file block accesses. This flexibility does not cause a large loss of performance — RAMA's simulated performance is within 10–15% of the optimum performance of a similarly-sized striped file system, and is a factor of 4 or more better than a striped file system with poorly laid out data.

[25]

Miller

E.L.

, Greenan

, Leung

. Reliable and efficient metadata storage and indexing using nvram.. 2008, Available from: dcslab.hanyang.ac.kr/nvramos08/EthanMiller.pdf.

[26]

Nagle

, Serenyi

, Matthews

The Panasas activescale storage cluster-delivering scalable high bandwidth storage.

2004, Proc. ACM/IEEE SC : 1 - 10

[27]

Okorafor

, Patrick

Availability of Jobtracker machine in hadoop/mapreduce zookeeper coordinated clusters

Adv. Comput, 2012, 3(3): 19 - 30 doi: 10.5121/acij.2012.3302

DOI:10.5121/acij.2012.3302 URL [Cited within: 1]

It is difficult to use the traditional Message Passing Interface (MPI) approach to implement synchronization, coordination, and prevent deadlocks in distributed systems. This difficulty is lessened by the use of Apaches Hadoop/MapReduce and Zookeeper to provide Fault Tolerance in a Homogeneously Distributed Hardware/Software environment. A mathematical model for the availability of the JobTracker in Hadoop/MapReduce using Zookeepers Leader Election Service is presented in this paper. Although the availability is less than what is expected in f+1 Fault Tolerance systems for crash failures, this approach makes coordination and synchronization easy, reduces the effect of Byzantine faults and provides Fault Tolerance for distributed systems. The results obtained show that the availability changes with change in the number of Zookeeper servers. This model can help determine how many servers are optimal for high availability, from which vendor they must be purchased, and when to use a Zookeeper coordinated Hadoop cluster to perform safety critical tasks.

[28]

Ousterhout

, Costa

, Harrison

. et al. .

A trace-driven analysis of the Unix 4.2 BSD file system

1985, SOSP: 15 - 24

We analyzed the UNIX 4.2BSD file system by recording activity in trace files and writing programs to analyze the traces. The trace analysis shows that the average file system bandwidth needed per user is low (a few hundred bytes per second). Most of the files accessed are short, are open a short time, and are accessed sequentially. Most new information is deleted or overwritten within a few minutes of its creation. We wrote a simulator that uses the traces to predict the performance of caches for disk blocks. The moderate-sized caches used in UNIX reduce disk traffic by about 50%, but larger caches (several megabytes) can achieve much greater reductions, eliminating 90% or more of all disk traffic. With those large caches, large block sizes (16 kbytes or more) result in the fewest disk accesses.

[29]

Raicu

, Foste

, Beckman

Making a case for distributed file systems at exascale

2011, Proc. 3rd Int. Workshop on Large-Scale System and Application Performance: 11 - 18 doi: 10.1145/1996029.1996034

DOI:10.1145/1996029.1996034 URL [Cited within: 1]

Exascale computers will enable the unraveling of significant scientific mysteries. Predictions are that 2019 will be the year of exascale, with millions of compute nodes and billions of threads of execution. The current architecture of high-end computing systems is decades-old and has persisted as we scaled from gigascales to petascales. In this architecture, storage is completely segregated from the compute resources and are connected via a network interconnect. This approach will not scale several orders of magnitude in terms of concurrency and throughput, and will thus prevent the move from petascale to exascale. At exascale, basic functionality at high concurrency levels will suffer poor performance, and combined with system mean-time-to-failure in hours, will lead to a performance collapse for large-scale heroic applications. Storage has the potential to be the Achilles heel of exascale systems. We propose that future high-end computing systems be designed with non-volatile memory on every compute node, allowing every compute node to actively participate in the metadata and data management and leveraging many-core processors high bisection bandwidth in torus networks. This position paper discusses this revolutionary new distributed storage architecture that will make exascale computing more tractable, touching virtually all disciplines in high-end computing and fueling scientific discovery.

[30]

Rodeh

, Teperman

ZFS—a scalable distributed file system using object disks

2003, IEEE Symp. on Mass Storage Systems : 207 - 218

URL [Cited within: 2]

zFS is a research project aimed at building a decentralized file system that distributes all aspects of file and storage management over a set of cooperating machines inter-connected by a high-speed network. zFS is designed to be a file system that scales from a few networked computers to several thousand machines and to be built from commodity off-the-shelf components. The two most prominent features of zFS are its cooperative cache and distributed transactions. zFS integrates the memory of all participating machines into one coherent cache. Thus, instead of going to the disk for a block of data already in one of the machine memories, zFS retrieves the data block from the remote machine. zFS also uses distributed transactions and leases, instead of group-communication and clustering software. This article describes the zFS high-level architecture and how its goals are achieved.

[31]

Satyanarayanan

, Kistler

, Kumar

. et al. .

Coda: a highly available file system for a distributed workstation environment

IEEE Trans. Comput, 1990, 39(4): 447 - 459 doi: 10.1109/12.54838

DOI:10.1109/12.54838 URL [Cited within: 1]

The design and implementation of Coda, a file system for a large-scale distributed computing environment composed of Unix workstations, is described. It provides resiliency to server and network failures through the use of two distinct but complementary mechanisms. One mechanism, server replication, stores copies of a file at multiple servers. The other mechanism, disconnected operation, is a mode of execution in which a caching site temporarily assumes the role of a replication site. The design of Coda optimizes for availability and performance and strives to provide the highest degree of consistency attainable in the light of these objectives. Measurements from a prototype show that the performance cost of providing high availability in Coda is reasonable

[32]

Shvachko

, Kuang

, Radia

. et al..

The Hadoop Distributed File System

2010, IEEE 26th Symp. on Mass Storage Systems and Technologies: 1 - 10

The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 25 petabytes of enterprise data at Yahoo!.

[33]

Torodanhan, Best Practice: DB2 High Availability Disaster Recovery2009, Available from:http://www.ibm.com/developerworks/wikis/display/data/Best+Practice+-+DB2+High+Availability+Disaster+Recovery.

[34]

U.S Department of Commerce/NIST, 1995, VA

FIPS 180-1. Secure Hash Standard. National Technical Information Service, Springfield

[35]

Wang

, Qiu

, Yang

Hadoop high availability through metadata replication

2009, Proc. 1st Int. Workshop on Cloud Data Management: 37 - 44 doi: 10.1145/1651263.1651271

DOI:10.1145/1651263.1651271 URL [Cited within: 1]

Hadoop is widely adopted to support data intensive distributed applications. Many of them are mission critical and require inherent high availability of Hadoop. Unfortunately, Hadoop has no high availability support yet, and it is not trivial to enhance Hadoop. Based on thorough investigation of Hadoop, this paper proposes a metadata replication based solution to enable Hadoop high availability by removing single point of failure in Hadoop. The solution involves three major phases: in initialization phase, each standby/slave node is registered to active/primary node and its initial metadata (such as version file and file system image) are caught up with those of active/primary node; in replication phase, the runtime metadata (such as outstanding operations and lease states) for failover in future are replicated; in failover phase, standby/new elected primary node takes over all communications. The solution presents several unique features for Hadoop, such as runtime configurable synchronization mode. The experiments demonstrate the feasibility and efficiency of our solution.

[36]

Weil

, Pollack

, Brandt

. et al. .

Dynamic metadata management for petabyte-scale file systems

2004, SC: 47

URL [Cited within: 2]

In petabyte-scale distributed file systems that decouple read and write from metadata operations, behavior of the metadata server cluster will be critical to overall system performance and scalability. We present a dynamic subtree partitioning and adaptive metadata management system designed to efficiently manage hierarchical metadata workloads that evolve over time. We examine the relative merits of our approach in the context of traditional workload partitioning strategies, and demonstrate the performance, scalability and adaptability advantages in a simulation environment.

[37]

Weil

, Brandt

, Miller

. et al. .

CEPH: a scalable, high-performance distributed file system

2006, OSDI: 307 - 320

URL [Cited within: 2]

We have developed Ceph, a distributed file system that provides excellent performance, reliability, and scalability. Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for heterogeneous and dynamic clusters of unreliable object storage devices (OSDs). We leverage device intelligence by distributing data replication, failure detection and recovery to semi-autonomous OSDs running a specialized local object file system. A dynamic distributed metadata cluster provides extremely efficient metadata management and seamlessly adapts to a wide range of general purpose and scientific computing file system workloads. Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second.

[38]

White

. Hadoop: the Definitive Guide. 2009, O’Reilly Media, Inc.

[39]

White

, Walker

, Humphrey

. et al. .

Legionfs: a secure and scalable file system supporting cross-domain highperformance applications

2001, Proc. ACM/IEEE Conf. on Supercomputing: 59

Realizing that current file systems can not cope with the diverse requirements of wide-area collaborations, researchers have developed data access facilities to meet their needs. Recent work has focused on comprehensive data access architectures. In order to fulfill the evolving requirements in this environment, we suggest a more fully-integrated architecture built upon the fundamental tenets of naming, security, scalability, extensibility, and adaptability. These form the underpinning of the Legion File System (LegionFS). This paper motivates the need for these requirements and presents benchmarks that highlight the scalability of LegionFS. LegionFS aggregate throughput follows the linear growth of the network, yielding an aggregate read bandwidth of 193.8 MB/s on a 100 Mbps Ethernet backplane with 50 simultaneous readers. The serverless architecture of LegionFS is shown to benefit important scientific applications, such as those accessing the Protein Data Bank, within both local- and wide-area environments.

[40]

Yadava

. The Berkeley DB Book. 2007, (Apress)