Tuesday, 21 October 2014

Code , Debug & Test Apache Pig Scripts using Eclipse on Windows

While developing a software, knowing how to debug the code is the most important part. It helps to solve any bugs in the code easily as well as helps us understand the internals of dependent framework code. It definitely applies to Apache Pig scripting. In this blog i will explain how to Code , Debug & Test Apache Pig Scripts using Eclipse on Windows.
  1. Install Eclipse Juno or above
  2. Install m2eclipse plugin
  3. Install JDK 1.6 or above
  4. Install Cygwin 1.7.5 or above
  5. <CYGWIN_HOME>/bin folder is added into PATH environment variable. CYGWIN_HOME is the installation directory of cygwin.
Before you start with following steps, make sure all prerequisites are met.
1. Start Eclipse
2. From Eclipse File menu, create a new project.
01 Createnewproject3. From New Project wizard, select Maven project and click on Next button.
02 createmavenproject
4. On New Maven Project screen, click on next button
03 createmavenproject2
5. Select “maven-archtype-quicktype” as the project archtype and click on Next button.
04 selectarchtype
6. Enter appropriate Group Id, Artifact Id, Version & Package name and click on Finish button.
05 createprojectformfill
7. It creates a maven project which is shown in the package explorer. I have just expanded the project to show its structure. Project consists of auto generated App.java file and its corresponding AppTest.java file. AppTest.java contains the junit test code inside it. Also there is a file named pom.xml. It is a maven metadata file.
06 eclipsemavenproject
8. Double click on pom.xml to open it in POM editor.
07 openpomfile
9. Click on the pom.xml tab in the POM editor to see the contents of pom.xml
08 switchtopomtextmode
10. Add cloudera repository information to the pom.xml. We need some important dependencies for this project (e.g.  pig, pigunit & hadoop). These dependencies are available as maven artifacts in the cloudera’s maven repository.
09 addclouderarepository
11. Add dependencies on pig, pigunit, hadoop and some other dependencies like antlr, jackson etc. We are going to debug pig scripts in the Eclipse. For that purpose we need pig and pigunit. Because pig requires hadoop-core to work, we also need to add dependency on hadoop-core.
view rawpigdependecy.xml hosted with ❤ by GitHub
10 addpigdependencies
12. While debugging we also need the source core and javadoc of dependencies. To enable downloading of source code and javadoc of dependencies,  go to menu Window > Preferences > Maven.
11 windowspreferences
13. Select the checkboxes for Download Artifact Sources and Download Artifact JavaDoc. Click on Apply and then OK. After this m2eclipse plugin downloads the sources and JavaDoc and attaches them to the artifact jar.
12 downloadsourcesandjavadoc
14. Right click on main folder and create a new folder under it.
13 createnewfolder
15. Name the newly created folder as resources.
14 createresourcesfolder
16. In resources folder, create two files named wordcount.pig and sample.data. Inwordcount.pig file, we are going to write a pig code that will count the number of occurrences of each word present in the sample.data file.
15 createnewfile
17. Add data into sample.data file.
Johny, Johny!
Yes, Papa
Eating sugar?
No, Papa
Telling lies?
No, Papa
Open your mouth!
Ha! Ha! Ha!
view rawsample.data hosted with ❤ by GitHub
17 createinputdata
18. Add pig code in the wordcount.pig file.
A = load 'src/main/resources/sample.data';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = group B by word;
D = foreach C generate COUNT(B), group;
dump D;
view rawwordcount.pig hosted with ❤ by GitHub
18 addpigcode
19. Now we need to add pig unit test case to test and debug this wordcount.pig file.
20. Double click on the AppTest.java file to open in Java Editor.
19 opentestcase
21. Open AppTest.java file. Remove all the existing functions from the AppTest class. And add testWordCountScript fucntion.
public void testStudentsPigScript() throws Exception {
PigTest pigTest = new PigTest("src/main/resources/wordcount.pig");
pigTest.assertOutput("D", new String[] { "(2,No)", "(3,Ha!)",
"(1,Yes)", "(1,Open)", "(3,Papa)", "(1,your)", "(1,Johny)",
"(1,lies?)", "(1,Eating)", "(1,Johny!)", "(1,mouth!)",
"(1,sugar?)", "(1,Telling)", });
view rawAppTest.java hosted with ❤ by GitHub
25 addtestcode
22. As we are going to run pig code in the eclipse, we need to use larger heap while running pig unit test case.
23. To do that, select the AppTest.java file. Go to menu Run > Run Configurations …
20 testcaserunconfiguration
24. In Run Configurations window, double click on JUnit to create a Run Configuration for AppTest
22 doubleclickjunit
25. Go to Arguments tab and in the VM arguments section add “-Xmx1024m” to set the max JVM heap to 1Gb.
23 jvmheapsize
26. Again Select the AppTest.java file. Go to menu Run > Debug Configurations …
25.1 debugconfig
27. Select AppTest and click on Debug
25.2 debugtest
28. Test case should execute successfully and a green bar is shown.
26 executetest
29. Now we want to debug COUNT udf. To do that, press Ctrl+Shift+T which is a Eclipse shortcut to open a class. In the Open Type window, Type COUNT in the first text field. This will automatically show all the classes related to name COUNT. We are interested in the COUNT class present under package org.apache.pig.builtin. Select that COUNT class. And click on the OK button.
26 opencountudf
30. Because we have enabled attachment of source to the jar, the source code of COUNT udf is shown in the java editor.
27 addinitialdebugpoint
31. Because COUNT is a aggregate UDF, it contain implementation for Initial, Intermed & Final states. We will just put debug breakpoint in all these three state implementations exec function.
29 addfinalbedugpoint
32. Again Select the AppTest.java file. Go to menu Run > Debug Configurations …
25.1 debugconfig
33. Select AppTest and click on Debug
25.2 debugtest
34. A dialog box “Confirm Perspective Switch” will appear. Click on Yes button.
30 remembermydecision
35. You can see the activate breakpoint in the Java editor. Now by using Eclipse debug functionality you can debug the complete COUNT udf.
31 debugpoint
36. Now you are ready to debug any UDF code or even pig code.
Hope it helps!

No comments:

Post a Comment

Related Posts Plugin for WordPress, Blogger...