Hadoop Code - Deep Dive Series - Part 2 - High Availabilty

Before I continue the deep dive series, I want to thank Apache Hadoop community in making sure following key aspects are taken care:

  • The code is commented well
  • Test cases are written to cover the features
  • Checkstyle coding standards enforcers

Code becomes document and speaks for itself if above things are taken care. Code will be maintainable, testable and clean. Apart from these, JIRA and automated patch validation using Jenkins/Hudson makes review process so easy.

High Availability (HA) is the topic which many of us are more curious about. While building distributed system, its not just scalability which is important, but making sure the system is highly available is also very important.

Consider a system which is highly scalable but not highly available will not give technology/engineering team to provide SLAs to business users. Forget about engineers and business users, imagine a service or product which allows you to upload virtually unlimited photos but if the product/service does not guarantee the photos be available for viewing when you want to view the photos??!! Do you want to use that product or service?

No

Continuing the deep dive series from part 1 of this series, I am selecting FailoverController class.

As per the Java documentation

"FailoverController is responsible for electing an active service on startup or when the current active is changing (eg due to failure), monitoring the health of a service, and performing a fail-over when a new active service is either manually selected by a user or elected."

This class has methods which are useful while developing a new HA Service. Like:

  • Pre-Failover checks: During failover, is the standby server ready to become active? Perform checks and validate. Perform sync operation.

    Example: Such as copy the fsimage and replay audits from shared file server to know what changed post creation of fsimage.
  • Graceful Fencing: Make sure the Active Server is stopped. Need more information why? Check STONITH (Shoot The Other Node In The Head) and Fencing
  • Force Failover Option - Even if the server ,to which service is transitioning to be active, is not ready - Try a forced Failover
  • Connect over Proxy: If you want to connect to the service via proxy. (May be for firewall or load balancing reasons?)
  • Configuration Check: Check if the target server, where the service is trying to failover is configured right? What's the point to failover a service to a server process which is not configured right?
  • Actual failover implementation: Make sure the active process is fenced. Pre-failover check makes sure the state information is propagated to standby. Perform the switchover to standby. If switchover does not succeed, revert to old. Of course, now the target will be first fenced to avoid Split-Brain problem.

Pre-Failover check and Check for right configuration is very important while switching to the new server. This will make sure that before clients start communicating to new active, make sure the server is ready to accept the requests.

Few other specifics like, maintaining low latency during fencing is also taken care. It should not happen that, Active is down and the FailoverController is busy in fencing Active (may be non responsive or unreachable) and failover to StandBy is delayed due to timeout and retry.

In the next part of this series I will cover Zookeeper Failover Controller. It will be helpful to know the importance of Zookeeper's awesome features and pretty solid method how HA is achieved.