# Pig Function Cheat Sheet

- Basic Operators
- Relational Operators
- Functions
- Load/Store Functions
- Math Functions
- String Functions
- Tuple, Bag, Map Functions
- UDFs
- Data Types
- Complex Types

# Basic Operators

Operator | Description | Example |
---|---|---|

Arithmetic Operators | +, -, *, /, %, ?: | X = FOREACH A GENERATE f1, f2, f1%f2; X = FOREACH A GENERATE f2, (f2==1?1:COUNT(B)); |

Boolean Operators | and, or, not | X = FILTER A BY (f1==8) OR (NOT (f2+f3 > f1)); |

Cast Operators | Casting from one datatype to another | B = FOREACH A GENERATE (int)$0 + 1; B = FOREACH A GENERATE $0 + 1, $1 + 1.0 |

Comparison Operators | ==, !=, >, <, >=, <=, matches | X = FILTER A BY (f1 == 8); X = FILTER A BY (f2 == ‘apache’); X = FILTER A BY (f1 matches ‘.*apache.*’); |

Construction Operators | Used to construct tuple (), bag {} and map [] | B = foreach A generate (name, age); B = foreach A generate {(name, age)}, {name, age}; B = foreach A generate [name, gpa]; |

Dereference Operators | dereference tuples (tuple.id or tuple.(id,…)), bags (bag.id or bag.(id,…)) and maps (map#’key’) | X = FOREACH A GENERATE f2.t1,f2.t3 (dereferencing is used to retrieve two fields from tuple f2) |

Disambiguate Operator | ( :: ) used to identify field names after JOIN, COGROUP, CROSS, or FLATTEN operators | A = load ‘data1′ as (x, y); B = load ‘data2′ as (x, y, z); C = join A by x, B by x; D = foreach C generate A::y; |

Flatten Operator | Flatten un-nests tuples as well as bags | consider a relation that has a tuple of the form (a, (b, c)). The expression GENERATE $0, flatten($1), will cause that tuple to become (a, b, c). |

Null Operator | is null, is not null | X = FILTER A BY f1 is not null; |

Sign Operators | + -> has no effect, – -> changes the sign of a positive/negative number | A = LOAD ‘data’ as (x, y, z); B = FOREACH A GENERATE -x, y; |

# Relational Operators

Operator | Description | Example |
---|---|---|

COGROUP/GROUP | Groups the data in one or more relations. The COGROUP operator groups together tuples that have the same group key (key field) | A = load ‘student’ AS (name:chararray,age:int,gpa:float); B = GROUP A BY age; |

CROSS | Computes the cross product of two or more relations | X = CROSS A,B A = (1, 2, 3) B = (2, 4) DUMP X; (4, 2, 1) (8, 9) (1,2,3,2,4) (1, 3) (1,2,3,8,9) (1,2,3,1,3) (4,2,1,2,4) (4,2,1,8,9) (4,2,1,1,3) |

DEFINE | Assigns an alias to a UDF or streaming command. | DEFINE CMD `perl PigStreaming.pl – nameMap` input(stdin using PigStreaming(‘,’)) output(stdout using PigStreaming(‘,’)); A = LOAD ‘file’; B = STREAM B THROUGH CMD; |

DISTINCT | Removes duplicate tuples in a relation. | X = DISTINCT A; A = (8,3,4) DUMP X; (1,2,3) (1,2,3) (4,3,3) (4,3,3) (4,3,3) (8,3,4) (1,2,3) |

FILTER | Selects tuples from a relation based on some condition. | X = FILTER A BY f3 == 3; A = (1,2,3) DUMP X; (4,5,6) (1,2,3) (7,8,9) (4,3,3) (4,3,3) (8,4,3) (8,4,3) |

FOREACH | Generates transformation of data for each row as specified | X = FOREACH A GENERATE a1, a2; A = (1,2,3) DUMP X; (4,2,5) (1,2) (8,3,6) (4,2) (8,3) |

IMPORT | Import macros defined in a separate file. | /* myscript.pig */ IMPORT ‘my_macro.pig’; |

JOIN | Performs an inner join of two or more relations based on common field values. | X = JOIN A BY a1, B BY b1; DUMP X (1,2,1,3) A = (1,2) B = (1,3) (1,2,1,2) (4,5) (1,2) (4,5,4,7) (4,7) |

LOAD | Loads data from the file system. | A = LOAD ‘myfile.txt’; LOAD ‘myfile.txt’ AS (f1:int, f2:int, f3:int); |

MAPREDUCE | Executes native MapReduce jobs inside a Pig script. | A = LOAD ‘WordcountInput.txt’; B = MAPREDUCE ‘wordcount.jar’ STORE A INTO ‘inputDir’ LOAD ‘outputDir’ AS (word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`; |

ORDERBY | Sorts a relation based on one or more fields. | A = LOAD ‘mydata’ AS (x: int, y: map[]); B = ORDER A BY x; |

SAMPLE | Partitions a relation into two or more relations, selects a random data sample with the stated sample size. | Relation X will contain 1% of the data in relation A. A = LOAD ‘data’ AS (f1:int,f2:int,f3:int); X = SAMPLE A 0.01; |

SPLIT | Partitions a relation into two or more relations based on some expression. | SPLIT input_var INTO output_var IF (field1 is not null), ignored_var IF (field1 is null); |

STORE | Stores or saves results to the file system. | STORE A INTO ‘myoutput’ USING PigStorage (‘*’); 1*2*3 4*2*1 |

STREAM | Sends data to an external script or program | A = LOAD ‘data’; B = STREAM A THROUGH `stream.pl -n 5`; |

UNION | Computes the union of two or more relations. (Does not preserve the order of tuples) | X = UNION A, B; A = (1,2,3) B = (2,4) DUMP X; (4,2,1) (8,9) (1,2,3) (1,3) (4,2,1) (2,4) (8,9) (1,3) |

# Functions

Function | Syntax | Description |
---|---|---|

AVG | AVG(expression | Computes the average of the numeric values in a single-column bag. |

CONCAT | CONCAT (expression, expression) | Concatenates two expressions of identical type. |

COUNT | COUNT(expression) | Computes the number of elements in a bag, it ignores null. |

COUNT_STAR | COUNT_STAR(expression) | Computes the number of elements in a bag, it includes null. |

DIFF | DIFF (expression, expression) | Compares two fields in a tuple, any tuples that are in one bag but not the other are returned in a bag. |

DIFF | DIFF (expression, expression) | Compares two fields in a tuple, any tuples that are in one bag but not the other are returned in a bag. |

IsEmpty | IsEmpty(expression) | Checks if a bag or map is empty. |

MAX | MAX(expression) | Computes the maximum of the numeric values or chararrays in a single-column bag |

MIN | MIN(expression) | Computes the minimum of the numeric values or chararrays in a single-column bag. |

SIZE | SIZE(expression) | Computes the number of elements based on any Pig data type. SIZE includes NULL values in the size computation |

SUM | SUM(expression) | Computes the sum of the numeric values in a single-column bag. |

TOKENIZE | TOKENIZE(expression [, 'field_delimiter']) | Splits a string and outputs a bag of words. |

# Load/Store Functions

FUnction | Syntax | Description |
---|---|---|

Handling Compression | A = load ‘myinput.gz’; store A into ‘myoutput.gz’; | PigStorage and TextLoader support gzip and bzip compression for both read (load) and write (store). BinStorage does not support compression. |

BinStorage | A = LOAD ‘data’ USING BinStorage(); | Loads and stores data in machine-readable format. |

JsonLoader, JsonStorage | A = load ‘a.json’ using JsonLoader(); | Load or store JSON data. |

PigDump | STORE X INTO ‘output’ USING PigDump(); | Stores data in UTF-8 format. |

PigStorage | A = LOAD ‘student’ USING PigStorage(‘\t’) AS (name: chararray, age:int, gpa: float); | Loads and stores data as structured text files. |

TextLoader | A = LOAD ‘data’ USING TextLoader(); | Loads unstructured data in UTF-8 format. |

# Math Functions

Operator | Description | Example |
---|---|---|

ABS | ABS(expression) | Returns the absolute value of an expression. If the result is not negative (x ≥ 0), the result is returned. If the result is negative (x < 0), the negation of the result is returned. |

ACOS | ACOS(expression) | Returns the arc cosine of an expression. |

ASIN | ASIN(expression) | Returns the arc sine of an expression. |

ATAN | ATAN(expression) | Returns the arc tangent of an expression. |

CBRT | CBRT(expression) | Returns the cube root of an expression. |

CEIL | CEIL(expression) | Returns the value of an expression rounded up to the nearest integer. This function never decreases the result value. |

COS | COS(expression) | Returns the trigonometric cosine of an expression. |

COSH | COSH(expression) | Returns the hyperbolic cosine of an expression. |

EXP | EXP(expression) | Returns Euler’s number e raised to the power of x. |

FLOOR | FLOOR(expression) | Returns the value of an expression rounded down to the nearest integer. This function never increases the result value. |

LOG | LOG(expression) | Returns the natural logarithm (base e) of an expression. |

LOG10 | LOG10(expression) | Returns the base 10 logarithm of an expression. |

RANDOM | RANDOM( ) | Returns a pseudo random number (type double) greater than or equal to 0.0 and less than 1.0. |

ROUND | ROUND(expression) | Returns the value of an expression rounded to an integer (if the result type is float) or rounded to a long (if the result type is double). |

SIN | SIN(expression) | Returns the sine of an expression. |

SINH | SINH(expression) | Returns the hyperbolic sine of an expression. |

SQRT | SQRT(expression) | Returns the positive square root of an expression. |

TAN | TAN(expression) | Returns the trignometric tangent of an angle. |

TANH | TANH(expression) | Returns the hyperbolic tangent of an expression. |

# String Functions

Operator | Description | Example |
---|---|---|

INDEXOF | INDEXOF(string, ‘character’, startIndex) | Returns the index of the first occurrence of a character in a string, searching forward from a start index. |

LAST_INDEX | LAST_INDEX_OF(expression) | Returns the index of the last occurrence of a character in a string, searching backward from a start index. |

LCFIRST | LCFIRST(expression) | Converts the first character in a string to lower case. |

LOWER | LOWER(expression) | Converts all characters in a string to lower case. |

REGEX_EXTRACT | REGEX_EXTRACT (string, regex, index) | Performs regular expression matching and extracts the matched group defined by an index parameter. The function uses Java regular expression form. |

REGEX_EXTRACT_ALL | REGEX_EXTRACT (string, regex) | Performs regular expression matching and extracts all matched groups. The function uses Java regular expression form. |

REPLACE | REPLACE(string, ‘oldChar’, ‘newChar’); | Replaces existing characters in a string with new characters. |

STRSPLIT | STRSPLIT(string, regex, limit) | Splits a string around matches of a given regular expression. |

SUBSTRING | SUBSTRING(string, startIndex, stopIndex) | Returns a substring from a given string. |

TRIM | TRIM(expression) | Returns a copy of a string with leading and trailing white space removed. |

UCFIRST | UCFIRST(expression) | Returns a string with the first character converted to upper case. |

UPPER | UPPER(expression) | Returns a string converted to upper case. |

# Tuple, Bag, Map Functions

Operator | Description | Example |
---|---|---|

TOTUPLE | TOTUPLE(expression [, expression ...]) | Converts one or more expressions to type tuple. |

TOBAG | TOBAG(expression [, expression ...]) | Converts one or more expressions to individual tuples which are then placed in a bag. |

TOMAP | TOMAP(key-expression, value-expression [, key-expression, value-expression ...]) | Converts key/value expression pairs into a map. Needs an even number of expressions as parameters. The elements must comply with map type rules. |

TOP | TOP(topN,column,relation) | Returns the top-n tuples from a bag of tuples. |

# User Defined Functions (UDFs)

Pig provides extensive support for user defined functions (UDFs) as a way to specify custom processing. Pig UDFs can currently be implemented in three languages: Java, Python, JavaScript and Ruby.

**Registering UDFs**

Registering Java UDFs:

---register_java_udf.pig register 'your_path_to_piggybank/piggybank.jar'; divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends:float);

Registering Python UDFs (The Python script must be in your current directory):

--register_python_udf.pig register 'production.py' using jython as bballudfs; players = load 'baseball' as (name:chararray, team:chararray, pos:bag{t:(p:chararray)}, bat:map[]);

**Writing UDFs**

Java UDFs:

package myudfs; import java.io.IOException; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; public class UPPER extends EvalFunc { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); }catch(Exception e){ throw new IOException("Caught exception processing input row ", e); } } }

Python UDFs

#Square - Square of a number of any data type @outputSchemaFunction("squareSchema") -- Defines a script delegate function that defines schema for this function depending upon the input type. def square(num): return ((num)*(num)) @schemaFunction("squareSchema") --Defines delegate function and is not registered to Pig. def squareSchema(input): return input #Percent- Percentage @outputSchema("percent:double") --Defines schema for a script UDF in a format that Pig understands and is able to parse def percent(num, total): return num * 100 / total

# Data Types

### Simple Types

Operator | Description | Example |
---|---|---|

int | Signed 32-bit integer | 10 |

long | Signed 64-bit integer | Data: 10L or 10l Display: 10L |

float | 32-bit floating point | Data: 10.5F or 10.5f or 10.5e2f or 10.5E2F Display: 10.5F or 1050.0F |

double | 64-bit floating point | Data: 10.5 or 10.5e2 or 10.5E2 Display: 10.5 or 1050.0 |

chararray | Character array (string) in Unicode UTF-8 format | hello world |

bytearray | Byte array (blob) | |

boolean | boolean | true/false (case insensitive) |

# Complex Types

Operator | Description | Example |
---|---|---|

tuple | An ordered set of fields. | (19,2) |

bag | An collection of tuples. | {(19,2), (18,1)} |

map | A set of key value pairs. | [name#John,phone#5551212] |