Tuesday, 21 October 2014

Snappy compression with Pig and native MapReduce

Assuming you have installed Hadoop on your cluster, if not please followhttp://code.google.com/p/hadoop-snappy/
This is the machine config of my cluster nodes, though the steps that follow could be followed with your installation/machine configs
pkommireddi@pkommireddi-wsl:/tools/hadoop/pig-0.9.1/lib$ uname -a
Linux pkommireddi-wsl 2.6.32-37-generic #81-Ubuntu SMP Fri Dec 2 20:32:42 UTC 2011 x86_64 GNU/Linux
Pig requires that the snappy jar and native be available on its classpath when a script is run.
The pig client here is installed at /tools/hadoop and the jar needs to be placed within $PIG_HOME/lib.
Also, you need to point PIG to the snappy native
export PIG_OPTS="$PIG_OPTS -Djava.library.path=$HADOOP_HOME/lib/native/Linux-amd64-64"
Now you have 2 ways to use map output compression in the Pig scripts:
  1. Follow instructions on http://code.google.com/p/hadoop-snappy/ to set map output compression at a cluster level
  2. Use Pig’s “set” keyword for per job level configuration
    set mapred.compress.map.output true;
    set mapred.map.output.compression.codec org.apache.hadoop.io.compress.SnappyCodec;
This should get you going with using Snappy for Map output compression with Pig. You can read and write Snappy compressed files as well, though I would not recommend doing that as its not very efficient space-wise compared to other compression algorithms. There is work being done to be able to use Snappy for creating intermediate/temporary files between multiple MR jobs. You can watch the work item herehttps://issues.apache.org/jira/browse/PIG-2319
Using Snappy for Native Java MapReduce:
Set Configuration parameters for Map output compression
Configuration conf = new Configuration();
conf.setBoolean("mapred.compress.map.output", true);
Set Configuration parameters for Snappy compressed intermediate Sequence Files
SequenceFileOutputFormat.setOutputCompressionType(conf, CompressionType.BLOCK); //Block level is better than Record level, in most cases
SequenceFileOutputFormat.setCompressOutput(conf, true);
  1. Map tasks begin transferring data sooner compared to Gzip or Bzip (though more data needs to be transferred to Reduce tasks)
  2. Reduce tasks run faster with better decompression speeds
  3. Snappy is not CPU intensive – which means MR tasks have more CPU for user operations
What you SHOULD use Snappy for
Map output: Snappy works great if you have large amounts of data flowing from Mappers to the Reducers (you might not see a significant difference if data volume between Map and Reduce is low)
Temporary Intermediate files (not available currently as of Pig 0.9.2, applicable only to native Map Reduce) : If you have a series of MR jobs chained together, Snappy compression is a good way to store the intermediate files. Please do make sure these intermediate files are cleaned up soon enough so we don’t have disk space issues on the cluster.
What you should NOT use Snappy for
Permanent Storage: Snappy compression is not efficient space-wise and it is expensive to store data on HDFS (3-way replication)
Plain text files: Like Gzip, Snappy is not splittable. Do not store plain text files in Snappy compressed form, instead use a container like SequenceFile.
Related Posts Plugin for WordPress, Blogger...