Google FIle System Overview

The Google File System demonstrates the qualities essential for supporting large-scale data processing workloads on commodity hardware. While some design decisions are specific to our unique setting, many may apply to data processing tasks of a similar magnitude and cost consciousness.

Summary

Things to Remember

Google File System (GFS) developed to cater data-intensive applications using a scalable distributed file system

GFS design offers many insights and ideas to the broader database and storage research community

Component failure is the norm rather than the exception.

Storage by a system is usually large in size (Multi-GB files are the common case).

Workloads consist of large streaming reads and small random reads.

Large sequential writes are made which append data to files.

Parallelism for clients should be maintained for read and write processes.

High sustained bandwidth is more important than low latency.

MCQs

No MCQs found.

Subjective Questions

No subjective questions found.

Videos

No videos found.

Google FIle System Overview

Simple NFS (network file system) drawbacks:

is build RPC's
low performance
security issue

Google File System (GoogleFS or GFS) is a proprietary distributed file system developed by Google for its own use. It is designed to provide efficient and reliable access to data using large clusters of commodity hardware[1]. The things in design space hold to make it superior to traditional choices are:

GFS characteristics

Files in GFS are organised hierarchically in directories and are identified by path-names.
Support the usual operations to create, delete, open, close, read, and write files.
Small as well as multi-GB files are common.
Each file typically contains many application objects such as web documents.
In the traditional write, the client specifies the offset where data is to be written. GFS provides an atomic append operation known as record append.
Concurrent writes to the same region are not serializable.
GFS has two operations, that are:
- Snapshot ;creates a copy of a file or a directory tree at low cost.
- Record append :allows multiple clients to append data to the same file concurrently while guaranteeing the atomicity of each individual client’s append. [3]

Common goals of GFS

Performance
Reliability
Automation
Fault Tolerance
Scalability
Availability [2]

The choices in design space holds to make it superior to traditional choices are:-

Component failures are the normal rather than the exception.
- File System consists of hundreds or even thousands of storage machines built from cheap commodity parts.Since this huge file storage system is built from inexpensive components or parts, the quality and quantity of parts virtually are sure that some are not functional at the time or can't currently recover from failure.Thus, error detection, fault tolerance, automatic recovery and constant monitoring system should be integral to the system.
Files are Huge. Multi-GB Files are common.
- Each file typically contains various application objects such as web documents.To handle multi-GB/TB files it is necessary to revisit the design assumptions and parameters like I/O operations and block size.
Append, Append, Append.
- Most files are changed by appending new data rather than overwriting existing data. Random writes practically don't exists.On huge files, appending becomes the focus of performance optimization and atomicity guarantees, while caching data blocks in the client loses its appeal.
Co-Designing
- Co-designing applications and file system API benefits overall system by increasing flexibility. [3]

These principles actually highlight the characteristics of ‘big data’ and its applications.

Why assume hardware failure is normal?

It is cheaper to assume common failure on poor hardware and account for it, rather than invest in expensive hardware and still experience occasional failure.
A number of layers in a distributed system (network, memory, disk,application, OS, physical connections, power,) mean failure on any could contribute to data corruption.

The GFS has some limitations,

lack of support for POSIX features
has high storage overhead
specialization for record appends instead of the small random writes

Small scale users might be better served by the traditional POSIX file system. For data that is usually appended to, not overwritten, and where extra commodity disks are not expensive enough to buy at 3x coverage, the GFS is a useful tool for working at large scale.

References:

"The Google File System",Google Research Publication, Howard Gobioffm Sanjay Ghemawat, and Shun-Tak Leung, 2003
"Distributed File System", LLC

Lesson

Google File System

Subject

Computer Engineering

Grade

Engineering

Recent Notes

No recent notes.

Related Notes

No related notes.