Hadoop

Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop is an Apache top-level project being built and used by a global community of contributors and users. Hadoop framework is composed of Hadoop Common, Distributed File System ,Hadoop YARN , Hadoop MapReduce .

Summary

Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop is an Apache top-level project being built and used by a global community of contributors and users. Hadoop framework is composed of Hadoop Common, Distributed File System ,Hadoop YARN , Hadoop MapReduce .

Things to Remember

MapReduce has undergone a complete overhaul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or YARN.

Apache™ Hadoop® YARN is a sub-project of Hadoop at the Apache Software Foundation introduced in Hadoop 2.0 that separates the resource management and processing components. YARN was born of a need to enable a broader array of interaction patterns for data stored in HDFS beyond MapReduce. The YARN-based architecture of Hadoop 2.0 provides a more general processing platform that is not constrained to MapReduce.

MCQs

No MCQs found.

Subjective Questions

No subjective questions found.

Videos

No videos found.

Hadoop

Hadoop

Introduction

Hadoop is an open source software framework that enables the distributed processing of large data sets across clusters of commodity servers. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

History

With the growth of WWW in the late 1900s and early 2000s, many indexes and search engines were created to find the information and text-based content. But it was inefficient as the web grew from dozens to millions of pages. Hence, automation was needed.One such project was an open-source web search engine called Nutch – the brainchild of Doug Cutting and Mike Cafarella. They wanted to return web search results faster by distributing data and calculations across different computers so multiple tasks could be accomplished simultaneously. During this time, another search engine project called Google was in progress. It was based on the same concept – storing and processing data in a distributed, automated way so that relevant web search results could be returned faster.

Hadoop Timeline
Hadoop Timeline

The Nutch project was divided – the web crawler portion remained as Nutch and the distributed computing and processing portion became Hadoop (named after Cutting’s son’s toy elephant). Now, Hadoop's framework and ecosystem are maintained by Apache Software Foundation, a non-profit global community of software developers and contributors.

Hadoop's features that make it so important:

  • Store and process Huge data quickly
  • Computing power : distributed computing model
  • Fault tolerance : redundancy and availability
  • Flexibility :schema-less, absorb any kind of data
  • Low cost : free and open-source framework + commodity hardware
  • Scalability

Hadoop Daemons
Hadoop consists of five daemons

  • NameNode
  • DataNode
  • Secondary nameNode
  • Job tracker
  • Task tracker

“Running Hadoop” means running a set of daemons, or resident programs, on the different servers in your network. These daemons have specific roles; some exist only on one server, some exist across multiple servers.


NameNode

  • Hadoop employs a master/slave architecture for both distributed storage and distributed computation.
    The distributed storage system is called the Hadoop File System, or HDFS.
  • The NameNode is the master of HDFS that directs the slave DataNode daemons to perform the low-level I/O tasks.
  • The NameNode is the bookkeeper of HDFS; it keeps track of how files are broken down into file blocks, which nodes store those blocks and the overall health of the distributed file system.
  • The function of the NameNode is the memory and I/O intensive. As such, the server hosting the NameNode typically doesn’t store any user data or perform any computations for a MapReduce program to lower the workload on the machine.
  • It’s a single point of failure of Hadoop cluster.
  • For any of the other daemons, if their host nodes fail for software or hardware reasons, the Hadoop cluster will likely continue to function smoothly or we can quickly restart it. Not so for the NameNode.



DataNode

  • Each slave machine in cluster host a DataNode daemon to perform work of the distributed file system, reading and writing HDFS blocks to actual files on the local file system.
  • Read or write a HDFS file, the file is broken into blocks and the NameNode will tell your client which DataNode each block resides in.
  • Job communicates directly with the DataNode daemons to process the local files corresponding to the blocks.
    Furthermore, a DataNode may communicate with other DataNodes to replicate its data blocks for redundancy.

Secondary NameNode

  • Each slave machine in cluster host a DataNode daemon to perform work of the distributed file system, reading and writing HDFS blocks to actual files on the local file system.
  • Read or write a HDFS file, the file is broken into blocks and the NameNode will tell your client which DataNode each block resides in.
  • Job communicates directly with the DataNode daemons to process the local files corresponding to the blocks.
    Furthermore, a DataNode may communicate with other DataNodes to replicate its data blocks for redundancy.


JobTracker

  • The JobTracker daemon is the liaison (mediator) between your application and Hadoop.
  • Once you submit your code to your cluster, the JobTracker determines the execution plan by determining which files to process, assigns nodes to different tasks, and monitors all tasks as they’re running.
  • Should a task fail, the JobTracker will automatically re launch the task, possibly on a different node, up to a predefined limit of retries.
  • There is only one JobTracker daemon per Hadoop cluster.
  • It’s typically run on a server as a master node of the cluster

Apache Hadoop's Core is consists of HDFS (storage part) and MapReduce (processing part).

The base of Apache Hadoop framework consists of :

  • Hadoop Common
  • Hadoop DFS
  • Hadoop Map Reduce
  • Hadoop YARN Yet Another Resource Negotiator (Hadoop 2)

Other components like Pig, Hive, Hbase, Phoenix, Spark, Zookeeper etc on the top of it to form an ecosystem.

fig:- Hadoop Ecosystem
fig:- Hadoop Ecosystem

Hadoop HDFS (for data storage)

HDFS is a distributed file system for redundant storage. It is designed to reliably store data on commodity hardware. It's one of the features is it is built to expect hardware failures.It links together the file systems on many local nodes to make them into one big file system.

fig:- HDFS Architecture
fig:- HDFS Architecture


Hadoop MapReduce (for data processing)

MapReduce is programming model for distributed computations at massive scale. It is an execution framework for organizing and performing such computations.

fig:- MapReduce Flow
fig:- MapReduce Flow

In a typical large data problem MapReduce do:

map()

  • Iterate over a large number of records
  • Extract something of interest from each
  • Shuffle and sort intermediate result

reduce()

  • Aggregate intermediate results
  • Generate final output

Apache Pig (for scripting in Hadoop)

Apache pig is High-level data flow language. It has two components:

  1. Pig Latin : data processing language
  2. Compiler : translate Pig Latin to MapReduce

Pig simply abstracts from specific details and allows to focus on data processing.

fig:- Pig Execution Diagram
fig:- Pig Execution Diagram

Apache Hive (for queries in Hadoop)

Hive is data warehousing layer on top of Hadoop. It allows analysis and queries using an SQL - like language

Hive is best for data analysts familiar with SQL who need to do ad-hoc queries, summarization, and data analysis.

fig:- Hive Architecture
fig:- Hive Architecture

ZooKeeper (for coordination)

  • Highly available ,scalable distributed configuration, consensus, group membership, leader election , naming and coordination service
  • Cluster management
  • Load balancing
  • JMX monitoring

Limitations of Hadoop System

Security concerns:

Its complexity comprises the security. It also lacks encryption at storage and network level.

Vulnerable by nature:

Java is heavily exploited language by cyber criminals. Being entirely built on Java it carries numerous security breaches.

Not fit for small data:

Not so efficient for random reading of small files.

Potential stability issues:

It is an open source project and it is recommended to make the sure user run latest stable version.

General limitations:

Hadoop has not all answer when it comes to Bigdata. Apache Flume, Mill-wheel, and Google's Cloud Dataflow provide can provide possible solutions.

References:

  1. "Quadbase Supports Hadoop Data Source", October 5, 2015 at quadbase.com
  2. "Hadoop: The definitive guide" ,2009, Tom White
  3. "Hadoop for dummies", Dirk Deroos
  4. "What is Hadoop? at www.sas.com
  5. "Hadoop Ecosystem" at thinkbiganalytics.com/leading_big_data_technologies

Lesson

Case Study Hadoop

Subject

Computer Engineering

Grade

Engineering

Recent Notes

No recent notes.

Related Notes

No related notes.