Hive Vs Pig
Feature
|
Hive
|
Pig
|
Language
|
SQL-like
|
PigLatin
|
Schemas/Types
|
Yes (explicit)
|
Yes (implicit)
|
Partitions
|
Yes
|
No
|
Server
|
Optional (Thrift)
|
No
|
User Defined Functions (UDF)
|
Yes (Java)
|
Yes (Java)
|
Custom Serializer/Deserializer
|
Yes
|
Yes
|
DFS Direct Access
|
Yes (implicit)
|
Yes (explicit)
|
Join/Order/Sort
|
Yes
|
Yes
|
Shell
|
Yes
|
Yes
|
Streaming
|
Yes
|
Yes
|
Web Interface
|
Yes
|
No
|
JDBC/ODBC
|
Yes (limited)
|
No
|
Apache Pig and Hive are two projects that layer on top of Hadoop, and provide a higher-level language for using Hadoop's MapReduce library. Apache Pig provides a scripting language for describing operations like reading, filtering, transforming, joining, and writing data -- exactly the operations that MapReduce was originally designed for. Rather than expressing these operations in thousands of lines of Java code that uses MapReduce directly, Pig lets users express them in a language not unlike a bash or perl script. Pig is excellent for prototyping and rapidly developing MapReduce-based jobs, as opposed to coding MapReduce jobs in Java itself.
If Pig is "Scripting for Hadoop", then Hive is "SQL queries for Hadoop". Apache Hive offers an even more specific and higher-level language, for querying data by running Hadoop jobs, rather than directly scripting step-by-step the operation of several MapReduce jobs on Hadoop. The language is, by design, extremely SQL-like. Hive is still intended as a tool for long-running batch-oriented queries over massive data; it's not "real-time" in any sense. Hive is an excellent tool for analysts and business development types who are accustomed to SQL-like queries and Business Intelligence systems; it will let them easily leverage your shiny new Hadoop cluster to perform ad-hoc queries or generate report data across data stored in storage systems mentioned above.
WORD COUNT EXAMPLE - PIG SCRIPT
Q) How to find the number of occurrences of the words in a file using the pig script?
You can find the famous word count example written in map reduce programs in apache website. Here we will write a simple pig script for the word count problem.
The following pig script finds the number of times a word repeated in a file:
Word Count Example Using Pig Script:
The above pig script, first splits each line into words using the TOKENIZE operator. The tokenize function creates a bag of words. Using the FLATTEN function, the bag is converted into a tuple. In the third statement, the words are grouped together so that the count can be computed which is done in fourth statement.
You can see just with 5 lines of pig program, we have solved the word count problem very easily.
You can find the famous word count example written in map reduce programs in apache website. Here we will write a simple pig script for the word count problem.
The following pig script finds the number of times a word repeated in a file:
Word Count Example Using Pig Script:
lines = LOAD '/user/hadoop/HDFS_File.txt' AS (line:chararray); words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word; grouped = GROUP words BY word; wordcount = FOREACH grouped GENERATE group, COUNT(words); DUMP wordcount;
The above pig script, first splits each line into words using the TOKENIZE operator. The tokenize function creates a bag of words. Using the FLATTEN function, the bag is converted into a tuple. In the third statement, the words are grouped together so that the count can be computed which is done in fourth statement.
You can see just with 5 lines of pig program, we have solved the word count problem very easily.
HOW TO FILTER RECORDS - PIG TUTORIAL EXAMPLES
Pig allows you to remove unwanted records based on a condition. The Filter functionality is similar to the WHERE clause in SQL. The FILTER operator in pig is used to remove unwanted records from the data file. The syntax of FILTER operator is shown below:
Here relation is the data set on which the filter is applied, condition is the filter condition and new relation is the relation created after filtering the rows.
Pig Filter Examples:
Lets consider the below sales data set as an example
1. select products whose quantity is greater than or equal to 1000.
2. select products whose quantity is greater than 1000 and year is 2001
3. select products with year not in 2000
You can use all the logical operators (NOT, AND, OR) and relational operators (< , >, ==, !=, >=, <= ) in the filter conditions.
<new relation> = FILTER <relation> BY <condition>
Here relation is the data set on which the filter is applied, condition is the filter condition and new relation is the relation created after filtering the rows.
Pig Filter Examples:
Lets consider the below sales data set as an example
year,product,quantity --------------------- 2000, iphone, 1000 2001, iphone, 1500 2002, iphone, 2000 2000, nokia, 1200 2001, nokia, 1500 2002, nokia, 900
1. select products whose quantity is greater than or equal to 1000.
grunt> A = LOAD '/user/hadoop/sales' USING PigStorage(',') AS (year:int,product:chararray,quantity:int); grunt> B = FILTER A BY quantity >= 1000; grunt> DUMP B; (2000,iphone,1000) (2001,iphone,1500) (2002,iphone,2000) (2000,nokia,1200) (2001,nokia,1500)
2. select products whose quantity is greater than 1000 and year is 2001
grunt> C = FILTER A BY quantity > 1000 AND year == 2001; (2001,iphone,1500) (2001,nokia,1500)
3. select products with year not in 2000
grunt> D = FILTER A BY year != 2000; grunt> DUMP D; (2001,iphone,1500) (2002,iphone,2000) (2001,nokia,1500) (2002,nokia,900)
You can use all the logical operators (NOT, AND, OR) and relational operators (< , >, ==, !=, >=, <= ) in the filter conditions.
CREATING SCHEMA, READING AND WRITING DATA - PIG TUTORIAL
The first step in processing a data set using pig is to define a schema for the data set. A schema is a representation of the data set in terms of fields. Let see how to define a schema with an example.
Consider the following products data set in Hadoop as an example:
Here first field is the product id, second field is the product name and third field is the product price.
Defining Schema:
The LOAD operator is used to define a schema for a data set. Let see different usages of the LOAD operator for defining the schema for the above dataset.
1. Creating Schema without specifying any fields.
In this method, we don't specify any field names for creating the schema. An example is shown below:
Pig is a data flow language. Each operational statement in pig consists of a relation and an operation. The left side of the statement is called relation and the right side is called the operation. Pig statements must terminated with a semicolon. Here A is a relation. /user/hadoop/products is the file in the hadoop.
To view the schema of a relation, use the describe statement which is shown below:
As there are no fields are defined, the above describe statement on A shows that "Schema for A unkown". To display the contents on the console use the DUMP operator.
To write the data set into HDFS, use the STORE operator as shown below
2. Defining schema without specifying any data types.
We can create a schema just by specifying the field names without any data types. An example is shown below:
The PigStorge is used to specify the field delimiter. The default field delimiter is tab. If your data is a tab separated, then you can ignore the USING PigStorage keywords. In the STORE operation, you can use the PigStorage class for specifying the output separator.
You have to specify the field names in the 'AS' clause. As we didn't specified any data type, by default pig assigned bytearray as the data type for the fields.
3. Defining schema with field names and data types.
To specify the data type use the colon. Take a look at the below example:
Accessing the Fields:
So far, we have seen how to define a schema, how to print the contents of the data on the console and how to write data to hdfs. Now we will see how to access the fields.
The fields can be accessed in two ways:
Example:
FOREACH is like a for loop used to iterate over the records of a relation. The GENERATE keyword specifies what operation to do on the record. In the above example, the GENERATE is used to get the fields from the relation A.
Note: It is always good practice to see the schema of a relation using the describe statement before performing a operation. By knowing the schema, you will know how to access the fields in the schema.
Consider the following products data set in Hadoop as an example:
10, iphone, 1000 20, samsung, 2000 30, nokia, 3000
Here first field is the product id, second field is the product name and third field is the product price.
Defining Schema:
The LOAD operator is used to define a schema for a data set. Let see different usages of the LOAD operator for defining the schema for the above dataset.
1. Creating Schema without specifying any fields.
In this method, we don't specify any field names for creating the schema. An example is shown below:
grunt> A = LOAD '/user/hadoop/products';
Pig is a data flow language. Each operational statement in pig consists of a relation and an operation. The left side of the statement is called relation and the right side is called the operation. Pig statements must terminated with a semicolon. Here A is a relation. /user/hadoop/products is the file in the hadoop.
To view the schema of a relation, use the describe statement which is shown below:
grunt> describe A; Schema for A unknown.
As there are no fields are defined, the above describe statement on A shows that "Schema for A unkown". To display the contents on the console use the DUMP operator.
grunt> DUMP A; (10,iphone,1000) (20,samsung,2000) (30,nokia,3000)
To write the data set into HDFS, use the STORE operator as shown below
grunt> STORE A INTO 'hadoop directory name'
2. Defining schema without specifying any data types.
We can create a schema just by specifying the field names without any data types. An example is shown below:
grunt> A = LOAD '/user/hadoop/products' USING PigStorage(',') AS (id, product_name, price); grunt> describe A; A: {id: bytearray,product_name: bytearray,price: bytearray} grunt> STORE A into '/user/hadoop/products' USING PigStorage('|'); --Writes data with pipe as delimiter into hdfs product directory.
The PigStorge is used to specify the field delimiter. The default field delimiter is tab. If your data is a tab separated, then you can ignore the USING PigStorage keywords. In the STORE operation, you can use the PigStorage class for specifying the output separator.
You have to specify the field names in the 'AS' clause. As we didn't specified any data type, by default pig assigned bytearray as the data type for the fields.
3. Defining schema with field names and data types.
To specify the data type use the colon. Take a look at the below example:
grunt> A = LOAD '/user/hadoop/products' USING PigStorage(',') AS (id:int, product_name:chararray, price:int); grunt> describe A; A: {id: int,product_name: chararray,price: int}
Accessing the Fields:
So far, we have seen how to define a schema, how to print the contents of the data on the console and how to write data to hdfs. Now we will see how to access the fields.
The fields can be accessed in two ways:
- Field Names: We can specify the field name to access the values from that particular value.
- Positional Parameters: The field positions start from 0 to n. $0 indicates first field, $1 indicates second field.
Example:
grunt> A = LOAD '/user/products/products' USING PigStorage(',') AS (id:int, product_name:chararray, price:int); grunt> B = FOREACH A GENERATE id; grunt> C = FOREACH A GENERATE $1,$2; grunt> DUMP B; (10) (20) (30) grunt> DUMP C; (iphone,1000) (samsung,2000) (nokia,3000)
FOREACH is like a for loop used to iterate over the records of a relation. The GENERATE keyword specifies what operation to do on the record. In the above example, the GENERATE is used to get the fields from the relation A.
Note: It is always good practice to see the schema of a relation using the describe statement before performing a operation. By knowing the schema, you will know how to access the fields in the schema.
PIG DATA TYPES - PRIMITIVE AND COMPLEX
Pig has a very limited set of data types. Pig data types are classified into two types. They are:
Primitive Data Types: The primitive datatypes are also called as simple datatypes. The simple data types that pig supports are:
Complex Types: Pig supports three complex data types. They are listed below:
Pig allows nesting of complex data structures. Example: You can nest a tuple inside a tuple, bag and a Map
Null: Null is not a datatype. Null is an undefined value or corrupted value. Example: Let say you have declared a field as int type. However that field contains character values. When reading data from this field, pig converts those character values(corrupted) values into Nulls. Any operation with Null results in Null. The Null in pig is similar to the Null in SQL.
- Primitive
- Complex
Primitive Data Types: The primitive datatypes are also called as simple datatypes. The simple data types that pig supports are:
- int : It is signed 32 bit integer. This is similar to the Integer in java.
- long : It is a 64 bit signed integer. This is similar to the Long in java.
- float : It is a 32 bit floating point. This data type is similar to the Float in java.
- double : It is a 63 bit floating pint. This data type is similar to the Double in java.
- chararray : It is character array in unicode UTF-8 format. This corresponds to java's String object.
- bytearray : Used to represent bytes. It is the default data type. If you don't specify a data type for a filed, then bytearray datatype is assigned for the field.
- boolean : to represent true/false values.
Complex Types: Pig supports three complex data types. They are listed below:
- Tuple : An ordered set of fields. Tuple is represented by braces. Example: (1,2)
- Bag : A set of tuples is called a bag. Bag is represented by flower or curly braces. Example: {(1,2),(3,4)}
- Map : A set of key value pairs. Map is represented in a square brackets. Example: [key#value] . The # is used to separate key and value.
Pig allows nesting of complex data structures. Example: You can nest a tuple inside a tuple, bag and a Map
Null: Null is not a datatype. Null is an undefined value or corrupted value. Example: Let say you have declared a field as int type. However that field contains character values. When reading data from this field, pig converts those character values(corrupted) values into Nulls. Any operation with Null results in Null. The Null in pig is similar to the Null in SQL.
RELATIONS, BAGS, TUPLES, FIELDS - PIG TUTORIAL
In this article, we will see what is a relation, bag, tuple and field. Let see each one of these in detail.
Lets consider the following products dataset as an example:
Lets consider the following products dataset as an example:
Id, product_name ----------------------- 10, iphone 20, samsung 30, Nokia
- Field: A field is a piece of data. In the above data set product_name is a field.
- Tuple: A tuple is a set of fields. Here Id and product_name form a tuple. Tuples are represented by braces. Example: (10, iphone).
- Bag: A bag is collection of tuples. Bag is represented by flower braces. Example: {(10,iphone),(20, samsung),(30,Nokia)}.
- Relation: Relation represents the complete database. A relation is a bag. To be precise relation is an outer bag. We can call a relation as a bag of tuples.
HOW TO RUN PIG PROGRAMS - EXAMPLES
Pig programs can be run in three methods which work in both local and MapReduce mode. They are
Script Mode or Batch Mode: In script mode, pig runs the commands specified in a script file. The following example shows how to run a pig programs from a script file:
Grunt Mode or Interactive Mode: The grunt mode can also be called as interactive mode. Grunt is pig's interactive shell. It is started when no file is specified for pig to run.
You can also run pig scripts from grunt using run and exec commands.
Embedded Mode: You can embed pig programs in java and can run from java.
- Script Mode
- Grunt Mode
- Embedded Mode
Script Mode or Batch Mode: In script mode, pig runs the commands specified in a script file. The following example shows how to run a pig programs from a script file:
> cat scriptfile.pig A = LOAD 'script_file'; DUMP A; > pig scriptfile.pig (pig script mode example) (pig runs on top of hadoop)
Grunt Mode or Interactive Mode: The grunt mode can also be called as interactive mode. Grunt is pig's interactive shell. It is started when no file is specified for pig to run.
> pig grunt> A = LOAD 'grunt_file'; grunt> DUMP A; (pig grunt or interactive mode example) (pig runs on top of hadoop)
You can also run pig scripts from grunt using run and exec commands.
grunt> run scriptfile.pig grunt> exec scriptfile.pig
Embedded Mode: You can embed pig programs in java and can run from java.
Thanks for the informative article. This is one of the best resources I have found in quite some time.
ReplyDeleteSpoken English Classes in Chennai
Best Spoken English Classes in Chennai
German Classes in Chennai
IELTS Coaching in Chennai
spanish classes in chennai
French Classes in Chennai
TOEFL Coaching in Chennai
Spoken English Classes in Velachery
Spoken English Classes in Tambaram
Spoken English Classes in Anna Nagar
Great Article Artificial Intelligence Projects
DeleteProject Center in Chennai
JavaScript Training in Chennai
JavaScript Training in Chennai
Such an interesting blog,i gather more useful information...
ReplyDeleteAviation Courses in Chennai
Air hostess Training Institute in Bangalore
air hostess training fees in mumbai
air hostess training in chennai
Aviation courses in Bangalore
air hostess training in chennai
Air Hostess Training Institute in chennai
Aviation Courses in Chennai
aviation institute in bangalore
air hostess course in chennai
Excellent blog thanks for sharing Run your salon business successfully by tying up with the best beauty shop in Chennai - The Pixies Beauty Shop. With tons of prestigious brands to choose from, and amazing offers we’ll have you amazed.
ReplyDeleteVery useful blog thanks for sharing With over a three decade of beauty expertise at our fingertips, we believed that everyone has the right to be beautiful. And so began the journey of our very own Pearl’s Beautician course in Chennai.
ReplyDeleteInteresting blog thanks for sharing While choosing your perfect ride for driving, Accord Cars comes with and the best packages for you to pick from. Car rentals for self drive in Chennai are done the easier. Just pick out your plan from hourly, daily, weekly and even monthly plans available.
ReplyDelete
ReplyDeleteVery useful blog thanks for sharing IndPac India the German technology Packaging and sealing machines in India is the leading manufacturer and exporter of Packing Machines in India.
This comment has been removed by the author.
ReplyDeleteUseful article which was very helpful. also interesting and contains good information.
ReplyDeleteto know about python training course , use the below link.
Python Training in chennai
Python Course in chennai
I am a blogger. Your site give me much information, Keep it and update. Thanks for providing this kind of information.
ReplyDeleteJava Training in Chennai
Java Course in Chennai
Thanks for sharing Let explore the world with us, to make an unbounded view on your product in the market with our android application development company in Chennai. We mainly work in this platform to publicize, sell and distribute Android applications to users around the world via google play. Let everyone play us via google play.
ReplyDeleteperde modelleri
ReplyDeleteNumara Onay
Mobil ödeme bozdurma
nft nasıl alınır
ANKARA EVDEN EVE NAKLİYAT
Trafik Sigortasi
dedektör
web sitesi kurma
AŞK ROMANLARI
uc satın al
ReplyDeletelisans satın al
en son çıkan perde modelleri
özel ambulans
minecraft premium
nft nasıl alınır
yurtdışı kargo
en son çıkan perde modelleri
elf bar
ReplyDeletebinance hesap açma
sms onay
YR6PK
betmatik
ReplyDeletekralbet
betpark
tipobet
slot siteleri
kibris bahis siteleri
poker siteleri
bonus veren siteler
mobil ödeme bahis
QBP