Friday, 17 October 2014

Hadoop Graphing with Cacti

What is Cacti?

Cacti is an RRD front end. You can learn more about it on the Cacti website.
Cacti differs from Ganglia in that Cacti polls using SNMP or shell scripts while applications push data at Ganglia. Both Ganglia and Cacti have feature overlaps, but for those with a large Cacti deployment, installing a secondary statistic system just for Hadoop may not be an option.
I have had great success over the years graphing everything from user CPU, NetApp disk reads to environmental sensors with Cacti. When I saw the information in Hadoop JMX, I started working on a set of Hadoop templates,hadoop-cacti-jtg. My goal was to provide visual representation for all pertinent Hadoop JMX information.
Administrators and developers can use these templates to better manage Hadoop and understand how it is working behind the scenes. Currently, the package has several predefined graphs covering the Hadoop NameNode and DataNode. Let’s walk through some of them.
Hadoop Capacity
Hadoop Capacity provides the same type of information you get from monitoring a standard disk. The top black line represents the maximum capacity. This is all the possible storage on all currently active DataNodes.
You also have the used and free capacity information stacked on top of each other. You can use these variables to trend your file system growth. In most cases your file system should be growing steadily, assuming you have batch processes running on a schedule. You may want to use a Cacti Threshhold alarm at 80%. If the alarm goes off, it’s good practice to clean up unused files, or you can take the lazy way and order more DataNodes :)
hadoop_name_cap.png
If you are wondering why the sum of used plus free is not equal to capacity, then remember that Hadoop has reserve for each DataNode. Also, your disk file system might have a reserve. If a disk is solely devoted to serving HDFS, you can tune the reserve down with the following string:
tunefs -m <percent>
Live vs. Dead Nodes
The Hadoop live and dead node information is available on the NameNode’s web interface. This stack-style graph shows both values together. Blue represents the number of live DataNodes, while the red area of the graph shows the number of dead DataNodes. If you are using the Cacti Threshhold system, you can use it to set off a warning if the number of Dead DataNodes exceeds 20%.

NameNode Stats
Hadoop JMX gives us a breakdown of file operations by type. This graph provides details about requests to which the NameNode is responding. I ran several teragens and terasorts from the examples.jar. Below, we can see the process both creating and reading files from the system as the map reduce jobs run.

DataNode Blocks
The DataNode statistics are similar to the NameNode statistics. This graph template can be applied to each DataNode, allowing you to track BlocksRead, BlocksWritten, BlocksRemoved, and BlocksReplicated. You can use this to find “hot spots” in your data. A hot spot is a piece of data that is commonly or frequently accessed. Increasing the replication to those files would help by spreading the access to other DataNodes.

 

Cacti Extras

Cacti offers many excellent out-of-the box features. The following add-on features are helpful for monitoring Hadoop deployments. You can find these on the Cacti site:
  • Linux Full CPU Graph – Adds IOWait and other kernel states. The default CPU graph only shows nice, user, and system.
  • Linux Full Memory Graph – The standard memory graph does not show swap usage.
  • Disk Utilization Graph – You can graph bytes written to physical devices from SNMP. This is helpful for underlying disk utilization and maximum possible disk performance.
  • RealTime Plugin – Used to graph data at 5-second intervals. By default, Cacti is running at 1-minute or 5-minute intervals, which is not helpful for Hadoop since the JMX is probably updating at 5-minute intervals. However, it is generally useful for real time reporting of other SNMP information.
  • THold Plugin – The Threshold plugin creates some overlap between Nagios and Cacti, and sends alarms when data exceeds high or low values.
  • Aggregate Plugin – The aggregate plugin is ideal for graphing clusters into a single graph. You may want to graph the “Open File Count” across several nodes – this plugin makes the graphing process fast and easy.

Where to go from Here

If you want to see the Hadoop Cacti templates in action, check out the Live Sample (user: hadoop, password: hadoop). To get started, simply follow the Installation Instructions. The project has the Apache V2 license. You can view theSource Repository. A Hudson system provides the latest build if you want to dig into the project source code.
Related Posts Plugin for WordPress, Blogger...