Pig和Hive简单操作

Pig and Hive basic use

Q1: Basic Pig Operations

(a) Install Pig in hortonworks cluster.

  1. Login in Ambari and stop service

  2. Click Add Service in Actions

  3. Choose Pig in the pop window and install

(b) Upload files to HDFS [1]

  1. Select the Files from the Off-canvas menu at the top.

  2. Navigate to /user/admin and click on the Upload button

  3. Click on the browse button to open a dialog box. Select the file and click Open.

  4. Upload the two files and see there are two new files in the directory.

(c) Pig Script and Output

  1. Click on the Pig View from the Off-canvas menu.

  2. Click the button + New Script at the top right and fill in a name, then write the code

  3. Run the script by clicking on the Execute button at the top right of the composition area

  4. Back to Files and the output save in /user/admin/output/

Pig Script [2]
1
2
3
4
5
6
7
8
9
Table1 = LOAD'googlebooks-eng-all-1gram-20120701-a' AS(bigram:chararray,year:int,match_count:double,volumn_count:int);
Table2 = LOAD'googlebooks-eng-all-1gram-20120701-b' AS(bigram:chararray,year:int,match_count:double,volumn_count:int);
Datas = UNION Table1, Table2;
Groups = GROUP Datas BY bigram;
Averages = FOREACH Groups GENERATE group, AVG(Datas.match_count) AS average;
Results = ORDER Averages BY average DESC;
Top20 = LIMIT Results 20;
DUMP Top20;
STORE Top20 INTO '/user/admin/output' USING PigStorage('\t');
Output









Q2: Basic Hive Operations

(a) Install Hive in hortonworks cluster.

(b) Hive Script and Output [3]

  1. Create table1 table2 and load data from HDFS
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    create table table1(bigram string, year int, match_count int, volume_count int)
    row format delimited
    fields terminated by '\t'
    stored as textfile;
    load data inpath '/user/admin/googlebooks-eng-all-1gram-20120701-a' overwrite into table table1;
    create table table2(bigram string, year int, match_count int, volume_count int)
    row format delimited
    fields terminated by '\t'
    stored as textfile;
    load data inpath '/user/admin/googlebooks-eng-all-1gram-20120701-b' overwrite into table table2;
  2. Union two tables and save into combined [4]
    1
    2
    3
    4
    5
    6
    7
    8
    9
    create table combined as
    select unioned.bigram,unioned.year,unioned.match_count
    from(
    select a.bigram,a.year,a.match_count
    from table1 a
    union all
    select b.bigram,b.year,b.match_count
    from table2 b
    ) unioned;
  3. Sum and save the number of occurrences per year and save into groups [5]
    1
    2
    3
    4
    create table groups as
    select bigram,year,sum(match_count) as match_count
    from combined
    group by bigram, year;
  4. compute average number of occurrences per year and save into averages
    1
    2
    3
    4
    create table averages as
    select bigram,avg(match_count) as avg_count
    from groups
    group by bigram;
  5. Create a table tops to store top 20 bigrams [6]
    1
    2
    create table tops as
    select * from averages order by avg_count desc limit 20;

    Defaul database in the end

  6. Output
    1
    select * from tops;

(c) Compare the performance with Pig

Pig overall run-time: 30min, 3sec

hive overall run-time: 11min, 59sec


From the time, we can see the time spent on hive is much less than that of pig. But I find a problem when check the applications on Hadoop. PigLatin run as Mapreduce jobs while Hive as TEZ jobs. That might be the reason why Hive is so faster than Pig

So I execute pig script with TEZ and re-run the script.