To better understand hadoop, it is always good to know the internals. With this motto, I started exploring hadoop code base (2.7.3 branch) starting with hadoop-common module in the hadoop-main parent project.

Disclaimer: This blog post is not for Hadoop beginners. This is for someone who is between Intermediate and Advanced who would like to know how things work in hadoop.

hadoop-common contains common code base used across all the hadoop frameworks such as YARN, HDFS and MR. This also includes helpful reusable code along with security related common code base.

Hadoop common has following package. We will dig deep in future blog posts in detail on each of these. These are high level descriptions.

  • conf: All about Configuration, which is used to pass the configuration across the frameworks.
  • crypto: All about Key Management, Password storage in Java key store,etc.
  • fs: File System related classes.
  • ha: High availability common code.
  • http: This module has Jetty HTTP Server implementation. The primary goal is to serve up status information for the server.
  • io: This module has Hadoop Primitives such as Writables, Java compression implementation and Serialization Implementation.
  • ipc: This package has all those common classes used in Inter Process Communication. It also has RPC implementations.
  • jmx: This provides readonly access to JMX metrics. Also has optional qry (query) parameter.
  • log: Provides ability to change log level at run time and a Event Counter.
  • metrics: Old Metrics package which contain classes for reporting hadoop metrics.
  • metrics2: New way of metrics collection in hadoop. HADOOP-6728 has more details including design.
  • net: This has Unix Domain Socket implementation, DNS, Network Topology and some wrapper classes for working with Sockets.
  • record: Deprecated. This has been replaced by Apache Avro.
  • service: All hadoop services implement classes in this package. This has Utility classes which deal with service states and Service state change listeners.
  • tools: Has few tools to get User Information and table listing (with headers) classes.
  • tracing: Classes in this package provides utility functions for tracing, command-line tool for viewing and modifying tracing settings provides functions for reading the names of SpanReceivers from the Hadoop configuration, adding those SpanReceivers to the Tracer, and closing those SpanReceivers when appropriate
  • util: This package is loaded with ton of classes. Few major ones are : Few Bloom Filter implementation. Also has some hash functions such as MurMur Hash, Jenkins Hash, Disk check utility, Sort implementation such as Heapsort, Quick Sort and Mergesort. CRC implementation. The ToolRunner is also hosted in this package.

Apart from the Java classes, there are native codes written in C in native directory.

This is just a high level information about the package. I will update the links to above points when the blog post is ready.

Feel free to suggest a change of edit to make this blog helpful. Also comment what packages are you more interested in. I can try to write if many are interested in a specific topic.

Start commenting and vote up good question in DisQus comment below.