Pig and Hive basic use

Q1: Basic Pig Operations

(a) Install Pig in hortonworks cluster.

Login in Ambari and stop service
Click Add Service in Actions
Choose Pig in the pop window and install

(b) Upload files to HDFS [1]

Select the Files from the Off-canvas menu at the top.
Navigate to /user/admin and click on the Upload button
Click on the browse button to open a dialog box. Select the file and click Open.
Upload the two files and see there are two new files in the directory.

(c) Pig Script and Output

Click on the Pig View from the Off-canvas menu.
Click the button + New Script at the top right and fill in a name, then write the code
Run the script by clicking on the Execute button at the top right of the composition area
Back to Files and the output save in /user/admin/output/

Pig Script [2]

Table1 = LOAD'googlebooks-eng-all-1gram-20120701-a' AS(bigram:chararray,year:int,match_count:double,volumn_count:int);
Table2 = LOAD'googlebooks-eng-all-1gram-20120701-b' AS(bigram:chararray,year:int,match_count:double,volumn_count:int);
Datas = UNION Table1, Table2;
Groups = GROUP Datas BY bigram;
Averages = FOREACH Groups GENERATE group, AVG(Datas.match_count) AS average;
Results = ORDER Averages BY average DESC;
Top20 = LIMIT Results 20;
DUMP Top20;
STORE Top20 INTO '/user/admin/output' USING PigStorage('\t');

Output

Q2: Basic Hive Operations

(a) Install Hive in hortonworks cluster.

(b) Hive Script and Output [3]

Create `table1` `table2` and load data from HDFS

create table table1(bigram string, year int, match_count int, volume_count int)
	row format delimited
	fields terminated by '\t'
	stored as textfile;
load data inpath '/user/admin/googlebooks-eng-all-1gram-20120701-a' overwrite into table table1;
create table table2(bigram string, year int, match_count int, volume_count int)
	row format delimited
	fields terminated by '\t'
	stored as textfile;
load data inpath '/user/admin/googlebooks-eng-all-1gram-20120701-b' overwrite into table table2;

Union two tables and save into `combined` [4]

create table combined as
	select unioned.bigram,unioned.year,unioned.match_count
	from(
	select a.bigram,a.year,a.match_count
	from table1 a
	union all
	select b.bigram,b.year,b.match_count
	from table2 b
	) unioned;

Sum and save the number of occurrences per year and save into `groups` [5]

create table groups as
	select bigram,year,sum(match_count) as match_count
	from combined
	group by bigram, year;

compute average number of occurrences per year and save into `averages`

create table averages as
	select bigram,avg(match_count) as avg_count
	from groups
	group by bigram;

Create a table `tops` to store top 20 bigrams [6]

1 2	create table tops as select * from averages order by avg_count desc limit 20;

Defaul database in the end

Output
1
select * from tops;

(c) Compare the performance with Pig

Pig overall run-time: 30min, 3sec

hive overall run-time: 11min, 59sec

From the time, we can see the time spent on hive is much less than that of pig. But I find a problem when check the applications on Hadoop. PigLatin run as Mapreduce jobs while Hive as TEZ jobs. That might be the reason why Hive is so faster than Pig

So I execute pig script with TEZ and re-run the script.

Pig and Hive basic use

Q1: Basic Pig Operations

(a) Install Pig in hortonworks cluster.

(b) Upload files to HDFS [1]

(c) Pig Script and Output

Pig Script [2]

Output

Q2: Basic Hive Operations

(a) Install Hive in hortonworks cluster.

(b) Hive Script and Output [3]

Create table1 table2 and load data from HDFS

Union two tables and save into combined [4]

Sum and save the number of occurrences per year and save into groups [5]

compute average number of occurrences per year and save into averages

Create a table tops to store top 20 bigrams [6]

Output

(c) Compare the performance with Pig

Create `table1` `table2` and load data from HDFS

Union two tables and save into `combined` [4]

Sum and save the number of occurrences per year and save into `groups` [5]

compute average number of occurrences per year and save into `averages`

Create a table `tops` to store top 20 bigrams [6]