Map Reduce

In this chapter, we discussed about functional programming. Then, we focused on MapReduce programming model and review how it works internally.

Summary

In this chapter, we discussed about functional programming. Then, we focused on MapReduce programming model and review how it works internally.

Things to Remember

MapReduce is built on the proven concept of divide and conquer.
MapReduce it’s much faster to break a massive task into smaller chunks and process them in parallel.
MapReduce works by breaking processing into two phases:
- map phase
- reduce phase
each phase has key-value pair as input and output,

MCQs

No MCQs found.

Subjective Questions

No subjective questions found.

Videos

No videos found.

Map Reduce

Functional Programming

It is a programming paradigm.
It is declarative programming :
- programming is done with declarations or expressions instead statements
- decompose the problem into the set of function

Functional Programming Languages

pure functional languages like Hope (academia), Lisp, Haskell, Mathematica etc.

an example:

not functional

a = 0

def increment():

global a

a += 1

functional

def increment(a):

return a + 1

MapReduce

Map-reduce is a programming model for data processing
MapReduce programs are inherently parallel
With enough machine, in there disposal can analyze very large-scale data.
Hadoop' can run MapReduce programs written in various languages

Map and Reduce

MapReduce works by breaking processing into two phases:
- map phase
- reduce phase
each phase has the key-value pair as input and output,
the programmer also specifies who function(method) for each phase map function and reduce function

Data flow

'MapReduce job' consists input data, the MapReduce program, and configuration information
Hadoop job is divided into tasks: 'map tasks' and 'reduce tasks'
two types of nodes to control job execution process:
- a job tracker
- a number of task trackers
job tracker schedules the tasks to run on task tracker
task tracker run tasks and send progress report to job tracker
job tracker keeps a record of all progress of each job failed tasks are rescheduled to different tasks tracker by job tracker

Hadoop divides the input to a MapReduce job into fixed-size pieces called input split
Hadoop creates one map task for each split, which runs the user defined map function for each record in the split.

the small splits then can be processed in parallel. for desirable load balancing splits should be fine-grained but if split is too small overhead increases.

fig:- Mapreduce Data Flow With Single Reduce Task — fIg:- MapReduce Data Flow With Single Reduce Task

fig:- Mapreduce Data Flow With No Reduce Task — fig:- MapReduce Data Flow With No Reduce Task

input, final output are stored in distributed file system job tracker tries to schedule the map task close to physical storage location of input data
intermediate results are stored on local FS of the map or reduce tasks output is often input to another map reduce task

Combiner

The Combiner class is used in between the map and reduce class to reduce the volume of data transfer between Map and Reduce.

Usually, the output of the map task is large and the data transferred to the reduce task is high.

Factors affecting the performance of MapReduce:

Hardware (or resources) such as CPU clock, disk I/O, network bandwidth, and memory size.
Storage Independence: The underlying storage system.
MapReduce Background: Data size for input data, shuffle data, and output data, which are closely correlated with the runtime of a job.
Programming Model: Job algorithms (or program) such as map, reduce, partition, combine, and compress. Some algorithms may be hard to conceptualize in MapReduce, or may be inefficient to express in terms of MapReduce.
Scheduling : impact of runtime scheduling cost

Another way to look at MapReduce is a 5-step parallel and distributed computation:

Prepare the Map() input – the "MapReduce system" designates Map processors, assigns the input key value K1 that each processor would work on, and provides that processor with all the input data associated with that key value.
Run the user-provided Map() code – Map() is run exactly once for each K1 key value, generating output organized by key values K2.
"Shuffle" the Map output to the Reduce processors – the MapReduce system designates Reduce processors, assigns the K2 key value each processor should work on, and provides that processor with all the Map-generated data associated with that key value.
Run the user-provided Reduce() code – Reduce() is run exactly once for each K2 key value produced by the Map step.
Produce the final output – the MapReduce system collects all the Reduce output, and sorts it by K2 to produce the final outcome.

These five steps can be logically thought of as running in sequence – each step starts only after the previous step is completed – although in practice they can be interleaved as long as the final result is not affected.

References:

"The Performance of MapReduce: An In-depth Study", Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu, 2010
"Hadoop: The Definitive Guide", Tim White, page 17, 2012
"Designing good MapReduce algorithms", Jeffrey D. Ullman, September 2012
"Hadoop for Dummies", Dirk Deroos
"Map Reduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop", Donald Milner

Lesson

Map Framework

Subject

Computer Engineering

Grade

Engineering

Recent Notes

No recent notes.

Related Notes

No related notes.