The discussion about Hadoop and Cassandra Summits in the San Francisco Bay Area. It was rewarding to talk to so many experienced Big Data technologists in such a short time frame – thanks to our partners DataStax and Hortonworks for hosting these great events! It was also great to see that performance is becoming an important topic in the community at large. We got a lot of feedback on typical Big Data performance issues and were surprised by the performance related challenges that were discussed. The practitioners here were definitely no novices, and the usual high level generic patterns and basic cluster monitoring approaches were not on the hot list. Instead we found more advanced problem patterns – for both Hadoop and Cassandra.
I’ve compiled a list of the most interesting and most common issues for Hadoop and Cassandra deployments:
Top Hadoop Issues
Map Reduce data locality
Data locality is one of the key advantage of Hadoop Map/Reduce; the fact that the map code is executed on the same data node where the data resides. Interestingly many people found that this is not always the case in practice. Some of the reasons they stated were:
- Speculative execution
- Heterogeneous clusters
- Data distribution and placement
- Data Layout and Input Splitter
The challenge becomes more prevalent in larger clusters, meaning the more data nodes and data I have the less locality I get. Larger clusters tend not to be complete homogeneous, some nodes are newer and faster then others, bringing the data to compute ratio out of balance. Speculative execution will attempt to use compute power even though the data might not be local. The nodes that contain the data in question might be computing something else, leading to another node doing non-local processing.The root cause might also lie in the data layout/placement and the used Input Splitter. Whatever the reason non-local data processing puts a strain on the network which poses a problem to scalability. The network becomes the bottleneck. Additionally, the problem is hard to diagnose because it is not easy to see the data locality.
To improve data locality, you need to first detect which of your jobs have a data locality problem or degrade over time. With APM solutions you can capture which tasks access which data nodes. Solving the problem is more complex and can involve changing the data placement and data layout, using a different scheduler or simply changing the number of mapper and reducer slots for a job. Afterwards, you can verify whether a new execution of the same workload has a better data locality ratio.
Job code inefficiencies and “profiling” Hadoop workloads
The next item confirmed our own views and is very interesting: many Hadoop workloads suffer from inefficiencies. It is important to note that this is not a critique on Hadoop but on the jobs that are run on it. However “profiling” jobs in larger Hadoop clusters is a major pain point. Black box monitoring is not enough and traditional profilers cannot deal with the distributed nature of a Hadoop cluster. Our solution to this problem was well received by a lot of experienced Hadoop developers. We also received a lot of interesting feedback on how to make our Hadoop job “profiling” even better.
TaskTracker performance and the impact on shuffle time
It is well known that shuffle is one of the main performance critical areas in any Hadoop job. Optimizing the amount of map intermediate data (e.g. with combiners), shuffle distribution (with partitioners) and pure read/merge performance (number of threads, memory on the reducing side) are described in many Performance Tuning articles about Hadoop. Something that is less often talked about but is widely discussed by the long-term “Hadoopers” is the problem of a slowdown of particular TaskTrackers.
When particular compute nodes are under high pressure, have degrading hardware, or run into cascading effects, the local TaskTracker can be negatively impacted. To put it in more simple terms, in larger systems some nodes will degrade in performance!
The result is that the TaskTracker nodes cannot deliver the shuffle data to the reducers as fast as they should or may react with errors while doing so. This has a negative impact on virtually all reducers and because shuffle is a choke point the entire job time can and will increase. While small clusters allow us to monitor the performance of the handful of running TaskTrackers, real world clusters make that infeasible. Monitoring with Ganglia based on averages effectively hides which jobs trigger this, which are impacted and which TaskTrackers are responsible and why.
The solution to this is a baselining approach, coupled with a PurePath/PureStack model. Baselining of TaskTracker requests solves the averaging and monitoring problem and will tell us immediately if we experience a degradation of TaskTracker mapOutput performance. By always knowing which TaskTrackers slow down, we can correlate the underlying JVM host health and we are able to identify if that slowdown is due to infrastructure or Hadoop configuration issues or tied to a specific operating system version that we recently introduced. Finally, by tracing all jobs, task attempts as well as all mapOutput requests from their respective task attempts and jobs we know which jobs may trigger a TaskTracker slowdown and which jobs suffer from it.
NameNode and DataNode slowdowns
Similar to the TaskTrackers and their effect on job performance, a slowdown of the NameNode or slowdowns of particular DataNodes have a deteriorating effect on the whole cluster. Requests can easily be baselined, making the monitoring and degradation detection automatic. Similarly, we can see which jobs and clients are impacted by the slowdown and the reason for the slowdown, be it infrastructure issues, high utilization or errors in the services.
Top Cassandra Issues
One of the best presentations about Cassandra performance was done by Spotify at the Cassandra Summit. If you use Cassandra or plan to use it you; I highly recommended to watch it!
Read Time degradation over time
As it turns out Cassandra is always fast when first deployed but there are many cases where read time degrades over time. Virtually all of the use cases center around the fact that over time, the rows get spread out over many SStables and/or deletes, which lead to tombstones. All of these cases can be attributed to wrong access patterns and wrong schema design and are often data specific. For example if you write new data to the same row over a long period of time (several months) then this row will be spread out over many SStables. Access to it will become slow while access to a “younger” row (which will reside in only one SSTable) will still be snappy. Even worse is a delete/insert pattern; adding and removing columns to the same row over time. Not only will the row be spread out, it will be full of tombstones and read performance will be quite horrible. The result is that the average performance might degrade only slightly over time (averaging effect). When in reality the performance of the older rows will degrade dramatically, while the younger rows stay fast.
To avoid this, never ever delete data as general pattern in your application and never write to the same row over long periods of time. To catch such a scenario you should baseline Cassandra read requests on a per column family basis. Base-lining approaches as compared to averages will detect a change in distribution and will notify you if a percentage of your requests degrade while the majority or some stay super fast. In addition by tying the Cassandra requests to the actual types of end-user requests you will be able to quickly figure out where that access anti-pattern originates.
Some slow Nodes can bring down the cluster
Like every real world application, Cassandra Nodes can slow down due to many issues (hardware, compaction, GC, network, disk etc.). Cassandra is a clustered database where every row exists multiple times in the cluster and every write request is sent to all nodes that contain the row (even on consistency level one). It is no big deal if a single node fails because others have the same data; all read and write requests can be fulfilled. In theory a super slow node should not be a problem unless we explicitly request data with consistency level “ALL,” because Cassandra would return when the required amount of nodes responded. However internally every node has a coordinator queue that will wait for all requests to finish, even if it would respond to the client before that has happened. That queue can fill up due to one super slow node and would effectively render a single node unable to respond to any requests. This can quickly lead to a complete cluster not responding to any requests.
The solution to this is twofold. If you can, use a token-aware client like Astyanax. By talking directly to the nodes containing the data item, this client effectively bypasses the coordinator problem. In addition you should baseline the response time of Cassandra requests on the server nodes and alert yourself if a node slows down. Funnily enough bringing down the node would solve the problem temporarily because Cassandra will deal with that issue nearly instantaneously.
Too many read round trips/Too much data read
Another typical performance problem with Cassandra reminds us of the SQL days and is typical for Cassandra beginners. It is a database design issue and leads to transactions that make too many requests per end-user transaction or read too much data. This is not a problem for Cassandra itself, but the simple fact of making many requests or reading more data slows down the actual transaction. While this issue can be easily monitored and discovered with an APM solution, the fix is not as trivial as in most cases it requires a change of code and the data model.
Hadoop and Cassandra are both very scalable systems! But as often stated scalability does not solve performance efficiency issues and as such neither of these systems is immune to such problems, nor to simple misuse.
Some of the prevalent performance problems are very specific to these systems and we have not seen them in more traditional systems. Other issues are not really new, except for fact that they now occur in systems that are tremendously more distributed and bigger than before. The very scalability and size makes these problems harder to diagnose (especially in Hadoop) while often having a very high impact (as in bringing down a Cassandra cluster). Performance experts can rejoice, they will have a job for a long time to come.