Pig and Hive basic use
Q1: Basic Pig Operations
(a) Install Pig in hortonworks cluster.
Login in Ambari and stop service
Click
Add Service
inActions
Choose Pig in the pop window and install
(b) Upload files to HDFS [1]
Select the
Files
from the Off-canvas menu at the top.Navigate to
/user/admin
and click on the Upload buttonClick on the browse button to open a dialog box. Select the file and click
Open
.Upload the two files and see there are two new files in the directory.
(c) Pig Script and Output
Click on the
Pig View
from the Off-canvas menu.Click the button
+ New Script
at the top right and fill in a name, then write the codeRun the script by clicking on the
Execute
button at the top right of the composition areaBack to
Files
and the output save in/user/admin/output/
Pig Script [2]
|
|
Output
Q2: Basic Hive Operations
(a) Install Hive in hortonworks cluster.
(b) Hive Script and Output [3]
Create
table1
table2
and load data from HDFS12345678910111213create table table1(bigram string, year int, match_count int, volume_count int)row format delimitedfields terminated by '\t'stored as textfile;load data inpath '/user/admin/googlebooks-eng-all-1gram-20120701-a' overwrite into table table1;create table table2(bigram string, year int, match_count int, volume_count int)row format delimitedfields terminated by '\t'stored as textfile;load data inpath '/user/admin/googlebooks-eng-all-1gram-20120701-b' overwrite into table table2;Union two tables and save into
combined
[4]123456789create table combined asselect unioned.bigram,unioned.year,unioned.match_countfrom(select a.bigram,a.year,a.match_countfrom table1 aunion allselect b.bigram,b.year,b.match_countfrom table2 b) unioned;Sum and save the number of occurrences per year and save into
groups
[5]1234create table groups asselect bigram,year,sum(match_count) as match_countfrom combinedgroup by bigram, year;compute average number of occurrences per year and save into
averages
1234create table averages asselect bigram,avg(match_count) as avg_countfrom groupsgroup by bigram;Create a table
tops
to store top 20 bigrams [6]12create table tops asselect * from averages order by avg_count desc limit 20;Defaul database in the end
Output
1select * from tops;
(c) Compare the performance with Pig
Pig overall run-time: 30min, 3sec
hive overall run-time: 11min, 59sec
From the time, we can see the time spent on hive is much less than that of pig. But I find a problem when check the applications on Hadoop. PigLatin run as Mapreduce jobs while Hive as TEZ jobs. That might be the reason why Hive is so faster than Pig
So I execute pig script with TEZ and re-run the script.