Already 6000+ students are trained in ORIENIT under Mr.Kalyan, Cloudera CCA175 Certified Consultant, Apache Contributor, 12+ years of IT exp, IIT Kharagpur, Gold Medalist
Hadoop provides a set of options on cpu, memory, disk, and network
for performance tuning. Most hadoop tasks are not cpu bounded, what we
usually look into is to optimize usage of memory and disk spills.
Memory tuning
The general rule for memory tuning is: use as much memory as you can,
but don’t trigger swapping. The parameter you can set for task memory
is mapred.child.java.opts. You can put it in your configuration file.
You can tune the best parameters for memory by monitoring memory
usage on server using Ganglia, Cloudera manager, or Nagios. Cloudera has
a slide focused on memory usage tuning, the link is here
Minimize the map disk spill
Disk IO is usually the performance bottleneck. There are a lot of
parameters you can tune for minimizing spilling. What I use the most
are:
Although you can further tune reducer buffer, mapper sort record
percent, and various of stuff, I found the best thing to do is reduce
the mapper output size. Most of the time, the performance is fast enough
after I refactor the mapper to output as little data as possible. For
more information, check the same cloudera’s performance tuning guide.
Tuning mapper tasks
Unlike reducer tasks which you can specify the number of reducer, the
number of mapper tasks is set implicitly. The tuning goal for the
mapper is control the amount of mapper and the size of each job. When
dealing with large files, hadoop split the file in to smaller chunk so
that mapper can run it in parallel. However, the initializing new mapper
job usually takes few seconds, this is also a overhead that we want to
minimize. These are the things you can do:
Reuse jvm task
If the average mapper running time is shorter than one minute, you can increase the mapred.min.split.size, so that less mappers are allocated in slot and thus reduces the mapper initializing overhead.
Use Combine file input format for bunch of smaller files. I had an implementation that also use mapred.min.split.size to implicitly control the mapper size.
Use configuration file and command line arguments to set parameters
When I first started on hadoop, I setup those parameters in java
program, but it is so hard-coded and inflexible. Thankfully, hadoop
provides Tool interface and ToolRunner class to parse those parameters for you. Here’s a sample program:
123456789101112
publicclassExampleJobextendsConfiguredimplementsTool{publicstaticvoidmain(String[]args)throwsException{System.exit(ToolRunner.run(newExampleJob(),args));}publicintrun(String[]args)throwsException{Configurationconf=getConf();Jobjob=newJob(conf);// configure the rest of the job}}
If your main class implements the interface, your program can take the config file as input:
1
hadoop jar ExampleJob-0.0.1.jar ExampleJob -conf my-conf.xml arg0 arg1
You can even pass extra parameters through command line like this:
1
hadoop jar ExampleJob-0.0.1.jar ExampleJob -Dmapred.reduce.tasks=20 arg0 arg1
Setting configuration as run-time arguments make you easier to test different parameters without recompile the program.
Tuning application-specific performance
Beyond general hadoop parameter setup, you can optimize your
map-reduce program using some small tricks. Here are the tricks that I
used the most.
Minimize your mapper output
Recall that mapper spill size is a serious performance bottleneck.
The size of mapper output is sensitive to disk IO, network IO, and
memory sensitive on shuffle phase. Minimizing the mapper output can
improve the general performance a lot.
To do this, you can try the following
Filter out records on mapper side, not on reducer side.
Use minimal data to form your map output key and map output value.
Extends BinaryComparable interface or use Text for your map output key
Set mapper output to be compressed
Above all the optimization tips, I found this make the biggest change
to many of my tasks, unless I can’t find a smaller key to reduce the
mapper output.
Balancing reducer’s loading
Another common performance issue that you might encounter is
unbalanced reducer tasks: one or several reducer takes most of the
output from mapper and ran extremely long compare to other reducers.
To solve this, you can either
Implement a better hash function in Partitioner class.
If you know what keys are causing the issue, you can write a
preprocess job to separate keys using MultipleOutputs. Then use another
map-reduce job to process the special keys that cause the problem.
Conclusion
It’s fun to write raw map-reduce jobs because it gives you more
precise control over performance tuning. If you already experienced hive
or pig, I encourage you to try how to optimize the same job using raw
map-reduce. You can find a lot of performance gain and more space to
tune the performance. For more curious, you can also check the Yahoo’s tuning hadoop performance guides.
No comments:
Post a Comment