Map Reduce
In this chapter, we discussed about functional programming. Then, we focused on MapReduce programming model and review how it works internally.
Summary
In this chapter, we discussed about functional programming. Then, we focused on MapReduce programming model and review how it works internally.
Things to Remember
- MapReduce is built on the proven concept of divide and conquer.
- MapReduce it’s much faster to break a massive task into smaller chunks and process them in parallel.
- MapReduce works by breaking processing into two phases:
- map phase
- reduce phase
- each phase has key-value pair as input and output,
MCQs
No MCQs found.
Subjective Questions
No subjective questions found.
Videos
No videos found.

Map Reduce
Functional Programming
- It is a programming paradigm.
- It is declarative programming :
- programming is done with declarations or expressions instead statements
- decompose the problem into the set of function
Functional Programming Languages
pure functional languages like Hope (academia), Lisp, Haskell, Mathematica etc.
an example:
MapReduce
- Map-reduce is a programming model for data processing
- MapReduce programs are inherently parallel
- With enough machine, in there disposal can analyze very large-scale data.
- Hadoop' can run MapReduce programs written in various languages

Map and Reduce
- MapReduce works by breaking processing into two phases:
- map phase
- reduce phase
- each phase has the key-value pair as input and output,
- the programmer also specifies who function(method) for each phase map function and reduce function
Data flow
- 'MapReduce job' consists input data, the MapReduce program, and configuration information
- Hadoop job is divided into tasks: 'map tasks' and 'reduce tasks'
- two types of nodes to control job execution process:
- a job tracker
- a number of task trackers
- job tracker schedules the tasks to run on task tracker
- task tracker run tasks and send progress report to job tracker
- job tracker keeps a record of all progress of each job failed tasks are rescheduled to different tasks tracker by job tracker
- Hadoop divides the input to a MapReduce job into fixed-size pieces called input split
- Hadoop creates one map task for each split, which runs the user defined map function for each record in the split.
the small splits then can be processed in parallel. for desirable load balancing splits should be fine-grained but if split is too small overhead increases.


- input, final output are stored in distributed file system job tracker tries to schedule the map task close to physical storage location of input data
- intermediate results are stored on local FS of the map or reduce tasks output is often input to another map reduce task
Combiner
The Combiner class is used in between the map and reduce class to reduce the volume of data transfer between Map and Reduce.
Usually, the output of the map task is large and the data transferred to the reduce task is high.

Factors affecting the performance of MapReduce:
- Hardware (or resources) such as CPU clock, disk I/O, network bandwidth, and memory size.
- Storage Independence: The underlying storage system.
- MapReduce Background: Data size for input data, shuffle data, and output data, which are closely correlated with the runtime of a job.
- Programming Model: Job algorithms (or program) such as map, reduce, partition, combine, and compress. Some algorithms may be hard to conceptualize in MapReduce, or may be inefficient to express in terms of MapReduce.
- Scheduling : impact of runtime scheduling cost
Another way to look at MapReduce is a 5-step parallel and distributed computation:
- Prepare the Map() input – the "MapReduce system" designates Map processors, assigns the input key value K1 that each processor would work on, and provides that processor with all the input data associated with that key value.
- Run the user-provided Map() code – Map() is run exactly once for each K1 key value, generating output organized by key values K2.
- "Shuffle" the Map output to the Reduce processors – the MapReduce system designates Reduce processors, assigns the K2 key value each processor should work on, and provides that processor with all the Map-generated data associated with that key value.
- Run the user-provided Reduce() code – Reduce() is run exactly once for each K2 key value produced by the Map step.
- Produce the final output – the MapReduce system collects all the Reduce output, and sorts it by K2 to produce the final outcome.
These five steps can be logically thought of as running in sequence – each step starts only after the previous step is completed – although in practice they can be interleaved as long as the final result is not affected.
References:
- "The Performance of MapReduce: An In-depth Study", Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu, 2010
- "Hadoop: The Definitive Guide", Tim White, page 17, 2012
- "Designing good MapReduce algorithms", Jeffrey D. Ullman, September 2012
- "Hadoop for Dummies", Dirk Deroos
- "Map Reduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop", Donald Milner
Lesson
Map Framework
Subject
Computer Engineering
Grade
Engineering
Recent Notes
No recent notes.
Related Notes
No related notes.