Monday, 20 October 2014

XML parsing using PIG

This are the steps for parsing your XML files by PIG

---------------------------------------------------------------------------------------------------------------

Step 1: Set the classpath for pig bin
export PATH=/home/hadoop/work/pig-0.11.1/bin:$PATH

Step 2: Register the jar file

REGISTER '/home/hadoop/work/pig-0.11.1/contrib/piggybank/java/piggybank.jar'

Step 3: Load the data

xml = load '/home/hadoop/work/hadoop-1.1.2/conf/mapred-site.xml' USING 
org.apache.pig.piggybank.storage.XMLLoader('name') as(doc:chararray);
@ data looks like
<property>
<name>fs.default.name</name>
<value>hdfs://localhsot:8020</value>
</property>

Step 4: Parse the file and retrieve the value

value = foreach xml GENERATE FLATTEN(REGEX_EXTRACT_ALL(doc,'<name>(.*)</name>'))  AS name:chararray;

Step 5: show the value

dump value;
Output will be:
fs.default.name

Parse the multiple attribute file
@ data looks like
<property>
 <fname>kalyan</fname>
 <lname>hadoop</lname>
 <landmark>annapurna block</landmark>
 <city>hyderabad</city>
 <state>Telengana</state>
 <contact>1234567890</contact>
 <email>kalyan@gmail.com</email>
 <PAN_Card>0011542</PAN_Card>
 <URL>kalyanhadooptraining.blogspot.com</URL>
</property>

Load the data:
pigdata = load '/home/hadoop/work/input/file.txt' USING 
org.apache.pig.piggybank.storage.XMLLoader('property') as (doc:chararray);

Parse the values:
values = foreach pigdata GENERATE FLATTEN(REGEX_EXTRACT_ALL(doc,'<property>\\s*<fname>(.*)</fname>\\s*<lname>(.*)</lname>\\s*<landmark>(.*)</landmark>\\s*<city>(.*)</city>\\s*<state>(.*)</state>\\s*<contact>(.*)</contact>\\s*<email>(.*)</email>\\s*<PAN_Card>(.*)</PAN_Card>\\s*<URL>(.*)</URL>\\s*</property>')) AS (fname:chararray, lname:chararray, landmark:chararray, city:chararray, state:chararray, contact:int, email:chararray, PAN_Card:long, URL:chararray);


dump values;
Output will be:
(kalyan,hadoop,annapurna block,hyderabad,Telengana,1234567890,kalyan@gmail.com,0011542,kalyanhadooptraining.blogspot.com)
Related Posts Plugin for WordPress, Blogger...