Wednesday, 18 June 2014

Pig Function Cheat Sheet

Pig Function Cheat Sheet

  • Basic Operators
  • Relational Operators
  • Functions
  • Load/Store Functions
  • Math Functions
  • String Functions
  • Tuple, Bag, Map Functions
  • UDFs
  • Data Types
  • Complex Types

Basic Operators

Arithmetic Operators+, -, *, /, %, ?:X = FOREACH A GENERATE f1, f2, f1%f2;
X = FOREACH A GENERATE f2, (f2==1?1:COUNT(B));
Boolean Operatorsand, or, notX = FILTER A BY (f1==8) OR (NOT (f2+f3 > f1));
Cast OperatorsCasting from one datatype to anotherB = FOREACH A GENERATE (int)$0 + 1;
B = FOREACH A GENERATE $0 + 1, $1 + 1.0
Comparison Operators==, !=, >, <, >=, <=, matchesX = FILTER A BY (f1 == 8);
X = FILTER A BY (f2 == ‘apache’);
X = FILTER A BY (f1 matches ‘.*apache.*’);
Construction OperatorsUsed to construct tuple (), bag {} and map []B = foreach A generate (name, age);
B = foreach A generate {(name, age)}, {name, age};
B = foreach A generate [name, gpa];
Dereference Operatorsdereference tuples ( or tuple.(id,…)), bags ( or bag.(id,…)) and maps (map#’key’)X = FOREACH A GENERATE f2.t1,f2.t3 (dereferencing is used to retrieve two fields from tuple f2)
Disambiguate Operator( :: ) used to identify field names after JOIN, COGROUP, CROSS, or FLATTEN operatorsA = load ‘data1′ as (x, y);
B = load ‘data2′ as (x, y, z);
C = join A by x, B by x;
D = foreach C generate A::y;
Flatten OperatorFlatten un-nests tuples as well as bagsconsider a relation that has a tuple of the form (a, (b, c)). The expression GENERATE $0, flatten($1), will cause that tuple to become (a, b, c).
Null Operatoris null, is not nullX = FILTER A BY f1 is not null;
Sign Operators+ -> has no effect, – -> changes the sign of a positive/negative numberA = LOAD ‘data’ as (x, y, z);

Relational Operators

COGROUP/GROUPGroups the data in one or more relations. The COGROUP operator groups together tuples that have the same group key (key field)A = load ‘student’ AS (name:chararray,age:int,gpa:float);
B = GROUP A BY age;
CROSSComputes the cross product of two or more relationsX = CROSS A,B A = (1, 2, 3) B = (2, 4)
DUMP X; (4, 2, 1) (8, 9)
(1,2,3,2,4) (1, 3)
DEFINEAssigns an alias to a UDF or streaming command.DEFINE CMD `perl – nameMap` input(stdin using PigStreaming(‘,’)) output(stdout using PigStreaming(‘,’));
A = LOAD ‘file’;
DISTINCTRemoves duplicate tuples in a relation.X = DISTINCT A; A = (8,3,4)
DUMP X; (1,2,3)
(1,2,3) (4,3,3)
(4,3,3) (4,3,3)
(8,3,4) (1,2,3)
FILTERSelects tuples from a relation based on some condition.X = FILTER A BY f3 == 3; A = (1,2,3)
DUMP X; (4,5,6)
(1,2,3) (7,8,9)
(4,3,3) (4,3,3)
(8,4,3) (8,4,3)
FOREACHGenerates transformation of data for each row as specifiedX = FOREACH A GENERATE a1, a2; A = (1,2,3)
DUMP X; (4,2,5)
(1,2) (8,3,6)
IMPORTImport macros defined in a separate file./* myscript.pig */
IMPORT ‘my_macro.pig’;
JOINPerforms an inner join of two or more relations based on common field values.X = JOIN A BY a1, B BY b1;
(1,2,1,3) A = (1,2) B = (1,3)
(1,2,1,2) (4,5) (1,2)
(4,5,4,7) (4,7)
LOADLoads data from the file system.A = LOAD ‘myfile.txt’;
LOAD ‘myfile.txt’ AS (f1:int, f2:int, f3:int);
MAPREDUCEExecutes native MapReduce jobs inside a Pig script.A = LOAD ‘WordcountInput.txt’;
B = MAPREDUCE ‘wordcount.jar’ STORE A INTO ‘inputDir’ LOAD ‘outputDir’
AS (word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`;
ORDERBYSorts a relation based on one or more fields.A = LOAD ‘mydata’ AS (x: int, y: map[]);
SAMPLEPartitions a relation into two or more relations, selects a random data sample with the stated sample size.Relation X will contain 1% of the data in relation A.
A = LOAD ‘data’ AS (f1:int,f2:int,f3:int);
X = SAMPLE A 0.01;
SPLITPartitions a relation into two or more relations based on some expression.SPLIT input_var INTO output_var IF (field1 is not null), ignored_var IF (field1 is null);
STOREStores or saves results to the file system.STORE A INTO ‘myoutput’ USING PigStorage (‘*’);
STREAMSends data to an external script or programA = LOAD ‘data’;
UNIONComputes the union of two or more relations. (Does not preserve the order of tuples)X = UNION A, B; A = (1,2,3) B = (2,4)
DUMP X; (4,2,1) (8,9)
(1,2,3) (1,3)


AVGAVG(expressionComputes the average of the numeric values in a single-column bag.
CONCATCONCAT (expression, expression)Concatenates two expressions of identical type.
COUNTCOUNT(expression)Computes the number of elements in a bag, it ignores null.
COUNT_STARCOUNT_STAR(expression)Computes the number of elements in a bag, it includes null.
DIFFDIFF (expression, expression)Compares two fields in a tuple, any tuples that are in one bag but not the other are returned in a bag.
DIFFDIFF (expression, expression)Compares two fields in a tuple, any tuples that are in one bag but not the other are returned in a bag.
IsEmptyIsEmpty(expression)Checks if a bag or map is empty.
MAXMAX(expression)Computes the maximum of the numeric values or chararrays in a single-column bag
MINMIN(expression)Computes the minimum of the numeric values or chararrays in a single-column bag.
SIZESIZE(expression)Computes the number of elements based on any Pig data type. SIZE includes NULL values in the size computation
SUMSUM(expression)Computes the sum of the numeric values in a single-column bag.
TOKENIZETOKENIZE(expression [, 'field_delimiter'])Splits a string and outputs a bag of words.

Load/Store Functions

Handling CompressionA = load ‘myinput.gz’;
store A into ‘myoutput.gz’;
PigStorage and TextLoader support gzip and bzip compression for both read (load) and write (store). BinStorage does not support compression.
BinStorageA = LOAD ‘data’ USING BinStorage();Loads and stores data in machine-readable format.
JsonLoader, JsonStorageA = load ‘a.json’ using JsonLoader();Load or store JSON data.
PigDumpSTORE X INTO ‘output’ USING PigDump();Stores data in UTF-8 format.
PigStorageA = LOAD ‘student’ USING PigStorage(‘\t’) AS (name: chararray, age:int, gpa: float);Loads and stores data as structured text files.
TextLoaderA = LOAD ‘data’ USING TextLoader();Loads unstructured data in UTF-8 format.

Math Functions

ABSABS(expression)Returns the absolute value of an expression. If the result is not negative (x ≥ 0), the result is returned. If the result is negative (x < 0), the negation of the result is returned.
ACOSACOS(expression)Returns the arc cosine of an expression.
ASINASIN(expression)Returns the arc sine of an expression.
ATANATAN(expression)Returns the arc tangent of an expression.
CBRTCBRT(expression)Returns the cube root of an expression.
CEILCEIL(expression)Returns the value of an expression rounded up to the nearest integer. This function never decreases the result value.
COSCOS(expression)Returns the trigonometric cosine of an expression.
COSHCOSH(expression)Returns the hyperbolic cosine of an expression.
EXPEXP(expression)Returns Euler’s number e raised to the power of x.
FLOORFLOOR(expression)Returns the value of an expression rounded down to the nearest integer. This function never increases the result value.
LOGLOG(expression)Returns the natural logarithm (base e) of an expression.
LOG10LOG10(expression)Returns the base 10 logarithm of an expression.
RANDOMRANDOM( )Returns a pseudo random number (type double) greater than or equal to 0.0 and less than 1.0.
ROUNDROUND(expression)Returns the value of an expression rounded to an integer (if the result type is float) or rounded to a long (if the result type is double).
SINSIN(expression)Returns the sine of an expression.
SINHSINH(expression)Returns the hyperbolic sine of an expression.
SQRTSQRT(expression)Returns the positive square root of an expression.
TANTAN(expression)Returns the trignometric tangent of an angle.
TANHTANH(expression)Returns the hyperbolic tangent of an expression.

String Functions

INDEXOFINDEXOF(string, ‘character’, startIndex)Returns the index of the first occurrence of a character in a string, searching forward from a start index.
LAST_INDEXLAST_INDEX_OF(expression)Returns the index of the last occurrence of a character in a string, searching backward from a start index.
LCFIRSTLCFIRST(expression)Converts the first character in a string to lower case.
LOWERLOWER(expression)Converts all characters in a string to lower case.
REGEX_EXTRACTREGEX_EXTRACT (string, regex, index)Performs regular expression matching and extracts the matched group defined by an index parameter. The function uses Java regular expression form.
REGEX_EXTRACT_ALLREGEX_EXTRACT (string, regex)Performs regular expression matching and extracts all matched groups. The function uses Java regular expression form.
REPLACEREPLACE(string, ‘oldChar’, ‘newChar’);Replaces existing characters in a string with new characters.
STRSPLITSTRSPLIT(string, regex, limit)Splits a string around matches of a given regular expression.
SUBSTRINGSUBSTRING(string, startIndex, stopIndex)Returns a substring from a given string.
TRIMTRIM(expression)Returns a copy of a string with leading and trailing white space removed.
UCFIRSTUCFIRST(expression)Returns a string with the first character converted to upper case.
UPPERUPPER(expression)Returns a string converted to upper case.

Tuple, Bag, Map Functions

TOTUPLETOTUPLE(expression [, expression ...])Converts one or more expressions to type tuple.
TOBAGTOBAG(expression [, expression ...])Converts one or more expressions to individual tuples which are then placed in a bag.
TOMAPTOMAP(key-expression, value-expression [, key-expression, value-expression ...])Converts key/value expression pairs into a map. Needs an even number of expressions as parameters. The elements must comply with map type rules.
TOPTOP(topN,column,relation)Returns the top-n tuples from a bag of tuples.

User Defined Functions (UDFs)

Pig provides extensive support for user defined functions (UDFs) as a way to specify custom processing. Pig UDFs can currently be implemented in three languages: Java, Python, JavaScript and Ruby.
Registering UDFs
Registering Java UDFs:
register 'your_path_to_piggybank/piggybank.jar';
divs      = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,
                date:chararray, dividends:float);
Registering Python UDFs (The Python script must be in your current directory):
register '' using jython as bballudfs;
players  = load 'baseball' as (name:chararray, team:chararray,
                pos:bag{t:(p:chararray)}, bat:map[]);
Writing UDFs
Java UDFs:
package myudfs;
import org.apache.pig.EvalFunc;

public class UPPER extends EvalFunc
   public String exec(Tuple input) throws IOException {
       if (input == null || input.size() == 0)
           return null;
              String str = (String)input.get(0);
              return str.toUpperCase();
           }catch(Exception e){
              throw new IOException("Caught exception processing input row ", e);
Python UDFs
#Square - Square of a number of any data type
@outputSchemaFunction("squareSchema") -- Defines a script delegate function that defines schema for this function depending upon the input type.
def square(num):
   return ((num)*(num))
@schemaFunction("squareSchema") --Defines delegate function and is not registered to Pig.
 def squareSchema(input):
   return input

 #Percent- Percentage
 @outputSchema("percent:double") --Defines schema for a script UDF in a format that Pig understands and is able to parse
 def percent(num, total):
   return num * 100 / total

Data Types

Simple Types

intSigned 32-bit integer10
longSigned 64-bit integerData: 10L or 10l
Display: 10L
float32-bit floating pointData: 10.5F or 10.5f or 10.5e2f or 10.5E2F
Display: 10.5F or 1050.0F
double64-bit floating pointData: 10.5 or 10.5e2 or 10.5E2
Display: 10.5 or 1050.0
chararrayCharacter array (string) in Unicode UTF-8 formathello world
bytearrayByte array (blob)
booleanbooleantrue/false (case insensitive)

Complex Types

tupleAn ordered set of fields.(19,2)
bagAn collection of tuples.{(19,2), (18,1)}
mapA set of key value pairs.[name#John,phone#5551212]
Related Posts Plugin for WordPress, Blogger...