🔍 The Brain Behind GFS: Master, Integrity, and Legacy | Google File System Part 4

2025-06-29

Intro

Welcome to the final chapter of our deep dive into the Google File System.

So far, we’ve explored why GFS was built, its architecture, and how chunk servers handle massive data reliably. In this part, we’ll focus on the Master — the brain of the system — and how it manages metadata, maintains high availability, and ensures system recovery.

We’ll also look at:

How GFS handles deletions and rebalancing
How data integrity is preserved using checksums
Systems inspired by GFS, like HDFS and Colossus

🔧The Master: The Brain of GFS

Let’s now talk about the Master, the central control node of the Google File System. Think of it as the head librarian in charge of tracking all books (files), their chapters (chunks), and where those books are placed across shelves (chunk servers) in a giant digital library.

But what exactly does the Master store?

🗂️ Master’s Metadata Responsibilities

The Master keeps track of three critical types of metadata:

File and directory hierarchy (the namespace) — like a directory tree on your computer
File-to-chunk mapping — which file is broken into which chunks
Chunk replica locations — which chunk servers are holding each chunk replica

Now let’s discuss how and where this metadata is stored, because this is where GFS makes clever trade-offs.

🧠 In-Memory vs Persistent Storage — A Trade-Off

You might assume all of this must be saved in a database or stored persistently on disk. But GFS intentionally stores only part of it persistently.

✅ Persistently Stored (on disk):

File namespace
File-to-chunk mapping

⚡Stored in Memory (volatile):

Chunk replica locations

Why store chunk locations in memory only? Isn’t that risky?

That’s a valid concern, because yes, if the Master crashes, it will lose this in-memory data.

But GFS intentionally chooses this design for the following reasons:

💡 Reason 1: Chunk Servers Are the Source of Truth

Instead of treating the Master as the absolute source of all information, GFS delegates the ownership of chunks to the chunk servers.

On Master startup, it simply reaches out to each chunk server and says:

“Hey, tell me what chunks you have.”

This is called the chunk report. Within seconds, the Master rebuilds its in-memory state.

🔄 Reason 2: Data is Continuously Refreshed

The Master regularly updates its in-memory chunk map via:

Startup polling: Queries all chunk servers for their current chunk lists.
Heartbeat messages: Every chunk server sends periodic health signals, including chunk lists.
During client operations: As clients read/write files, chunk locations are dynamically updated.

This makes the chunk location metadata fresh, low-latency, and resilient, even without disk persistence.

⚖️ Trade-Off: Speed vs Durability

Storing chunk locations persistently

❌ Not done in GFS
Reason: Too slow, and data often becomes stale by the time it’s read

Storing chunk locations in memory

✅ GFS stores in memory
Reason: Super fast access, and can be easily reconstructed on startup
This trade-off improves system latency and scalability while trusting the chunk servers to report honestly, which is acceptable in a trusted datacenter environment.

Think of the Master as a head librarian who keeps a notebook listing which assistant is responsible for which section of the library. Instead of asking assistants on every book request, she maintains her notebook by periodically checking in with them through regular status reports or when a new request comes in that might need an update.

If the librarian loses her notebook (e.g., a crash), she can simply walk around the library, ask each assistant what books they’re holding, and quickly rebuild it. This makes the system both fast and resilient.

💥 What Happens on Master Failure?

If the Master crashes:

The new Master instance reads the persistent metadata (file structure and chunk mappings) from disk.
It rebuilds the chunk location map by polling all chunk servers.
Within a few seconds, the system is back to full operation.

This allows for fast recovery without a need for complex consensus protocols (like Paxos or Raft), which are typically required for keeping replicated metadata in sync.

🔁 Summary of the Master’s Design Philosophy

Metadata that must be durable (like file structure) → stored persistently via operation logs
Metadata that is volatile but reconstructable (chunk locations) → stored in memory only
Fast recovery relies on the fact that chunk servers are the source of truth

This separation makes the Master lightweight, fast, and easy to failover — all while keeping the system scalable and robust.

💾 Where is the Log Stored?

The log is stored locally on the Master machine.
It’s also replicated to multiple remote machines, ensuring that even if the Master crashes or the disk is lost, another machine has the complete operation history.

💡 Only after the log is safely written and replicated does the Master acknowledge a client’s request.

This ensures the system can always recover from crashes — no operation is ever “half-done.”

⏩ Checkpoints: Fast Forwarding Recovery

While the operation log ensures durability, there’s a problem:

Over time, the log can grow very large. Replaying millions of operations during recovery can take too long.

To fix this, GFS introduces checkpoints.

🔹 What is a Checkpoint?

A checkpoint is a snapshot of the Master’s current state, like a save point in a video game. It includes:

The full namespace (directories and files)
The current file-to-chunk mapping

These checkpoints are stored in a compact B-tree format on disk.

🔁 How Do Recovery and Boot-Up Work?

Let’s walk through a Master recovery process:

Load the latest checkpoint from disk
Replay all operations in the operation log that happened after the checkpoint
Rebuild in-memory chunk location map (as covered earlier)
The Master is now fully restored!

This hybrid of checkpoints + logs provides a balance of:

Fast recovery (don’t need to replay the entire log)
Low overhead (no need to checkpoint too frequently)

🧠 Analogy: A Librarian’s Diary + Save Points

Imagine the Master as a head librarian maintaining a diary of every change made to the library:

“Created new shelf in Science section”
“Moved Book X to Shelf Y”
“Deleted old magazine stack”

To speed up the next librarian’s job, she also creates full snapshots of the library layout every few days (checkpoints).

If the librarian leaves or loses her diary, the new one can:

Load the last saved layout
Replay changes from the diary since then
And the library is back in shape!

⚖️ Trade-Off: Logs vs Snapshots

📝 Operation Log

Frequency: Recorded after every metadata-changing operation
Storage Overhead: Small, since it’s append-only
Recovery Time: Slower if the log grows large (requires replaying all operations since the last checkpoint)
Best Use: Ideal for durability and maintaining the exact operation order

📍 Checkpoint

Frequency: Created periodically (e.g., every few hours)
Storage Overhead: Higher, since it stores the entire system state
Recovery Time: Much faster — only need to replay log entries after the last checkpoint
Best Use: Enables quick recovery in case of Master failure

GFS combines the two for both durability and performance.

✅ Summary

The Master records every metadata change in an append-only operation log
This log is replicated across machines for durability
Periodic checkpoints allow the Master to avoid replaying the entire log on startup
The recovery process is simple, fast, and consistent
This model avoids the need for heavyweight transactional databases while still ensuring reliability at scale

🧹 Garbage Collection in GFS: Lazy, Safe, and Efficient

Deleting a file might seem like a straightforward operation, but in a distributed file system like GFS, deletion involves multiple layers:

Removing metadata from the Master
Deleting chunk data from all chunkservers
Avoiding accidental data loss
Keeping the system consistent and efficient

GFS solves this using a lazy garbage collection strategy.

🐢 Why Lazy Deletion?

In GFS, when a client deletes a file, the deletion doesn’t happen immediately. Instead, the file is:

Marked for deletion in the Master’s metadata
Left untouched on chunkservers for a short, configurable grace period (typically 3 days)

Why delay the deletion?

✅ Reasons

Accidental deletions can be undone within the grace period
Immediate cleanup across thousands of chunkservers is expensive and slow
Simplifies coordination by batching deletions
Allows the system to clean up during low-load periods

🔄 Deletion Workflow: Step-by-Step

Let’s walk through how file deletion actually works:

Client initiates delete request → The Master logs the operation in the operation log (so it’s durable)
Master marks the file as deleted, but:

The metadata is not immediately removed
Chunkservers are not immediately contacted

Lazy cleanup begins later, in two phases:

Phase 1: Metadata cleanup

During periodic scans (e.g. every few hours), the Master checks for files marked as deleted more than 3 days ago, and removes them from its namespace and chunk mapping.

Phase 2: Data cleanup on chunkservers

During regular heartbeat communication, the Master sends a list of orphaned chunks to each chunkserver, asking them to delete those chunks.

🧠 Analogy: Book Removal from a Library

Imagine a librarian who gets a request to remove outdated books from the catalog.

She first tags the books as “to be removed” in her log
But she doesn’t immediately take them off the shelves
Instead, during her weekly check, she identifies books marked for removal more than 3 days ago
Then she tells the assistants (chunkservers) to remove them during routine rounds

This way, any book mistakenly marked can be recovered within a grace period, and the library doesn’t waste effort handling removals one by one.

⚖️ Trade-Offs of Lazy Deletion

Immediate deletion

❌ Not done
Benefit: Avoids complex coordination across the system Delayed deletion (e.g., 3-day wait)
✅ Yes
Benefit: Allows for recovery of mistakenly deleted files and simplifies the deletion workflow

Per-request cleanup

❌ Avoided
Benefit: Prevents sudden load spikes on the system

Batched cleanup via heartbeat

✅ Used
Benefit: Reuses existing communication (heartbeats), reducing overhead

overhead

💥 What If…?

What if a chunkserver misses the delete command?

It will get the updated list in the next heartbeat. GFS tolerates temporary inconsistency. What if the Master crashes after marking a file for deletion?

The deletion is already in the operation log. The new Master will replay the log and continue from where it left off. Can the client undelete a file?

Not directly. But since the chunks still exist during the grace period, restoring the metadata (via admin tooling or logging) can make it accessible again.

✅ Summary

File deletion in GFS is delayed and batched, not immediate
Deletions are first logged for durability, then cleaned up lazily
Chunk deletion happens via heartbeats, making the process scalable
This approach minimises coordination overhead and allows safe recovery from accidental deletions

⚖️ Rebalancing: Keeping GFS Even and Healthy

In a distributed file system like GFS, chunkservers can vary in:

Available disk space
Network bandwidth
Hardware health
Rack or datacenter location

If chunk placement is not managed continuously, some chunkservers may become overloaded while others sit idle. To avoid this, GFS includes a rebalancing process led by the Master.

🎯 Why Is Rebalancing Needed?

Although the Master places chunks intelligently when files are first created, the system evolves over time:

Some chunkservers fill up faster
Hardware may fail and come back online
Clients write to some files more heavily than others
Some racks may get more chunk replicas than others

To maintain:

Uniform disk usage
Balanced read/write load
Fault tolerance across racks

…the Master must periodically rebalance the system.

🔄 How Rebalancing Works

The Master doesn’t rebalance aggressively — instead, it scans the cluster periodically to identify imbalance, and only then takes action.

Here’s how it works:

Scan for Imbalance

Checks if some chunkservers are over-utilised while others are under-utilised
Look for chunks that don’t follow replica placement policies (e.g., all replicas on one rack)

Select Candidate Chunks

Target replicas that can be safely moved without impacting availability
Ensures replicas are not being read/written during the move

Copy to New Location

Creates a replica on a better-suited chunkserver
Once the new replica is healthy, the Master deletes the old one

Update Metadata

The Master updates its in-memory and persistent metadata to reflect the new location

🧠 Analogy: Reorganising Library Shelves

Imagine a library where some shelves are packed and others are nearly empty. Over time, certain books are read more than others, some sections become outdated, and some shelves get relocated to a better-lit corner.

Every few days, the librarian rearranges books: moving them from crowded shelves to emptier ones, ensuring popular books are placed in more accessible spots, and keeping backup copies of important books in different rooms.

This keeps the library organised, accessible, and resilient to any section becoming unusable.

🧠 Placement Strategy: More Than Just Balance

The Master also considers fault tolerance and availability when rebalancing:

Tries to place chunk replicas on different racks (rack-aware placement)
Avoids placing multiple replicas on chunkservers with recent issues
May avoid chunkservers with high network latency or pending errors

This means rebalancing isn’t just about equal space, it’s about smart space.

⚖️ Trade-Offs of Rebalancing

CPU / bandwidth cost

✅ GFS uses passive, periodic scanning Benefit: Avoids impacting critical paths or live client operations

Replica consistency

✅ GFS rebalances one replica at a time Benefit: Ensures no data loss and safe transitions

Risk of stale replicas

✅ GFS validates the new replica before deleting the old one Benefit: Prevents serving corrupt or partial data

Rack-awareness

✅ Enforced during initial placement and rebalancing Benefit: Ensures fault tolerance even in the case of a full rack failure

💥 What If…?

What if a chunk is being read while being moved?

GFS creates the new replica in parallel. Only after validation is the old one removed. What if a chunkserver becomes full before the Master rebalances?

Writes to that chunkserver are rejected. The Master stops assigning new chunks there and moves data off it during the next cycle. What if rebalancing creates temporary duplication?

That’s acceptable. GFS handles temporary redundancy to ensure safety and consistency.

✅ Summary

GFS periodically rebalances chunks to ensure uniform disk usage and load distribution
Rebalancing improves performance, avoids hot spots, and strengthens fault tolerance
Chunk movement is done carefully, ensuring data integrity and system availability
Rack-aware placement ensures GFS can tolerate entire rack failures without data loss

✅ Checksums: The Silent Guardian of Data Integrity

In a distributed system like GFS, hardware failures are expected, not exceptional.

Disks can silently corrupt data, memory flips can occur, or a faulty disk controller might return incorrect blocks. When working with commodity machines, these risks become even more pronounced.

GFS tackles this with a robust checksum mechanism that ensures the integrity of every chunk it serves.

🔎 What Is a Checksum?

A checksum is a small, fixed-size hash value calculated from a block of data. If the data changes even slightly (e.g., a single bit flip), the checksum changes, allowing the system to detect corruption.

🧱 How GFS Uses Checksums

Each chunk in GFS is divided into 64KB blocks, and for each block, a checksum is calculated and stored.

Where is it stored?

The checksum is persisted alongside the chunk on the chunkserver’s local disk
It may also be cached in memory to avoid recalculating during frequent reads

🔁 Chunk Read Flow with Checksums

Here’s what happens when a chunk is read:

The chunkserver loads the requested 64KB block from disk
It computes a fresh checksum for that block
It compares the freshly computed checksum with the stored checksum
If they match → ✅ data is served to the client, If they don’t match → ❌ corruption is detected In case of a mismatch:
The chunkserver refuses to serve the corrupt data
It notifies the Master about the corruption
The Master then clones a new replica from one of the healthy chunkservers
The corrupted replica is eventually deleted

🧠 Why Split into 64KB Blocks?

Instead of using one checksum for the full 64MB chunk, GFS uses small 64KB sub-blocks. Why?

Early corruption detection

Even small bit flips are caught with high precision

Efficient reads

Only the affected block (64KB) needs to be re-validated if corruption is suspected

Parallelism

Multiple blocks can be validated in parallel, improving performance

Partial recovery

A corrupted 64KB block doesn’t invalidate the entire 64MB chunk — only that specific block is affected

This approach is particularly powerful in workloads like streaming, logging, and large analytics, where read performance and data integrity are both critical.

🧠 Analogy: Book Pages with Watermarks

Imagine each chunk as a book and each 64KB block as a page.

GFS applies a unique watermark (checksum) on each page.

When someone requests a book:

The system quickly checks every page for watermark validity
If any page is missing or altered, that copy of the book is rejected
A clean copy is fetched from another bookshelf

This protects the reader from reading corrupted or misleading information, just like checksums protect clients from bad data.

⚖️ Trade-Offs of Checksum-Based Integrity

Performance vs Protection

✅ GFS validates every read block
Benefit: Slight CPU overhead in exchange for strong data integrity

Memory usage vs Speed

✅ GFS caches checksums in memory
Benefit: Reduces repeated disk reads and speeds up validation

Chunk size vs Granularity

✅ GFS uses 64KB blocks
Benefit: Balances fine-grained corruption detection with minimal overhead

💥 What If…?

What if all replicas of a chunk are corrupted?

Rare but possible. The Master marks the chunk as permanently lost, alerts are raised, and GFS treats the file as partially unreadable. Can the client detect corruption itself?

Not directly. GFS ensures that chunkservers perform checksum validation before responding to client requests. Does this add latency to reads?

Slightly, yes. But the cost is low, and checksum validation ensures correctness over speed, which is critical for trusted storage.

✅ Summary

GFS splits each 64MB chunk into 64KB blocks and stores checksums for each block
During reads, chunkservers validate block integrity before serving
Corrupted data is not served; the Master repairs it using healthy replicas
This mechanism is lightweight, resilient, and critical for working with unreliable hardware
GFS prefers silent self-healing and data trust over blind performance

🧠 Conclusion: What GFS Taught Us

Throughout this series, we’ve explored the why, how, and what of the Google File System — a system that fundamentally redefined how large-scale storage should work in a world where hardware failures are normal and data is massive.

Let’s quickly recap what made GFS revolutionary:

🔑 Key Takeaways

Scalability through simplicity: GFS avoids complexity where possible — single Master for metadata, chunked file storage, and append-friendly writes.
Failure is expected, not exceptional: Everything in GFS — replication, checksums, heartbeats — is designed assuming systems will fail.
Metadata is powerfully managed: The Master doesn’t need to know everything all the time — only what’s essential and easy to reconstruct.
Optimised for real workloads: GFS is not about elegance in theory — it’s about practical design decisions for logs, streams, and large files.

🎬 Outro: End of One Series, Start of Another

GFS may have been built for Google’s internal needs, but its ideas have shaped distributed storage systems across the world — from Hadoop to Colossus, and beyond.

We hope this series helped you understand not just the architecture, but the reasoning behind it — the trade-offs, the design choices, and the operational mindset.

If you learned something new, found a fresh perspective, or have a question — feel free to drop a comment or message.

🚀 What’s Next?

We’ll continue exploring the foundational systems that power the modern web — breaking them down into stories that are simple, insightful, and practical.

Stay tuned, and see you in the next series!

References

https://storage.googleapis.com/gweb-research2023-media/pubtools/4446.pdf

GFS Other Parts

Other Blogs

https://medium.com/@shivamgor498/java-virtual-thread-ced98c382212

Enjoyed the read? Share it:

Twitter LinkedIn WhatsApp