Categories
Data Mining

Distributed File System Implementations and MapReduce strategy

We have already mentioned the MapReduce distributed computation style in data analysis for computing clusters in the previous post. Here we want to touch more on the matter of implementation of this strategy for distributed hardware.

Distributed File Systems (DFS)

Intro

The main idea of distributed file systems is to perform large-scale computations on spread machine nodes. DFS takes advantage of the power of parallelism – parallel computing.  The biggest advantage after the performance increase is the system’s tolerance for hardware faults. Here are some prerequisites for the use of DFS for massive data sets:

  • Data amount can be enormous, possibly terabytes, pentabytes?! in size. Small data chunks are not optimal when using DFS calculations.
  • Data should be updated rarely; if the data are changed frequently, they won’t be suitable for spread calculations for known reasons (replication issues and redundancy).
Some DFS instances

To date there are several distributed file systems that are working in practice. Among these are:
1. The Google File System (GFS), the original of the class.
2. Hadoop Distributed File System (HDFS), an open-source DFS used with Hadoop, an implementation of map-reduce (see Section 2.2) and distributed by the Apache Software Foundation.
3. CloudStore, an open-source DFS originally developed by Kosmix.

Some features of DFS implementation

Files are divided into chunks, which are typically 64 megabytes in size. Chunks are replicated at several compute nodes. Moreover, the nodes holding copies of one chunk should be located on different racks* so we don’t lose all copies due to a rack failure. Normally, both the chunk size and the degree of replication can be decided by the user.
To navigate through the chunks of a file, there is another small file called the master node or name node for that file. The master node is itself replicated, and a directory for the file system, as a whole, knows where to find its copies.

*Compute nodes are stored on racks, perhaps 8–64 on a rack. The nodes on a single rack are connected by a network, typically gigabit Ethernet. There can be many racks of compute nodes, and racks are connected by another level of network or a switch.

DFS Implementations

Modern calculations in DFS are implemented with Map-Reduce strategy (one in the field). Below we have drawn the basic diagram for MapReduce implementation. This form for inputs and outputs originates from the desire to allow composition of several map-reduce processes, the latter might be cycled:

Example: Counting each word’s occurrences in documents

We want to provide you with an illustration of a map-reduce computation using DFS.

Map

How about counting the number of occurrences for each word in a collection of documents? In this example, the input file is a
repository of documents, and each document is an element. The Map function for this example uses keys that are of a string type (the words) and values that are integers. The Map task reads a document and breaks it into its sequence of words w1, w2, . . . , wn. It then issues a sequence of key-value pairs where the value is always 1. That is, the output of the Map task for this document is the sequence of key-value pairs: (w1, 1), (w2, 1), . . . , (wn, 1). Note that a single Map task will typically process many documents – all the documents in one or more chunks. Thus, its output will be more than the sequence for the one document suggested above. Note also that if a word w appears m times among all the documents assigned to that process, then there will be m key-value pairs (w, 1) among its output.

Reduce

To associate those m pairs into a single pair (w, m), we will need to perform an associative and commutative operation, addition to the values. The Reduce function simply adds up all the values. The output of a reducer consists of the word and the sum. Thus, the output of all the Reduce tasks is a sequence of (w, m) pairs, where w is a word that appears at least once among all the input documents and m is the total number of occurrences of w among all those documents.

For further reference on DFS read this doc.

Summary

DFS is the modern technique for large-scale, spread computation. The Big Data era is pushing for these kinds of scaled, spread and fault proof computations. Who knows but that within a decade, this kind of computing based on MapReduce and other algorithms in DFS will possibly change the culture and business operations and cause people to reach for global data values.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.