Tuesday, February 18, 2014

Pentaho Big Data : Pig Script Executor

Apache Pig is a high level data analysis language capable of handling very high data volume. Ease of programming, parallelization, extensibility and optimization opportunities are some of the key features of this platform.

Pig Script Executor job entry step can be used to execute "Pig Latin" script on a Hadoop cluster.
Stepwise illustration on how to configure PDI "Pig Script Executor" is given below.

Hortonworks Sandbox version 2.0 used for demo purpose. Refer link for more info.
http://hortonworks.com/products/hortonworks-sandbox/#overview

Step 1

Open Spoon and create a new job.
Drag the "Pig Script Executor" step into the canvas.


Step 2

Enter HDFS host and port information.
Similarly provide Job tracker host name and port.


Step 3

Enter name and location of the pig script or select the script using "Browse" option.
Check "Enable Blocking" in case down stream entries need to wait for successful execution of the Pig script. Local execution option can be enabled for local testing purposes.


Pig Script ( sum_trans_pig_script.pig )

trans = load '/user/hue/transactions/store_transactions.csv' using PigStorage(',');
trans_line = FOREACH trans GENERATE TRIM($2) as storeName, $4 as transAmt;
grp_trans = GROUP trans_line by (storeName);
sum_amt = FOREACH grp_trans GENERATE group as grp,(ROUND(100f*SUM(trans_line.transAmt)))/100f as sum_amt;
STORE sum_amt INTO '/user/hue/transactions/trans_out' USING PigStorage(',');


Input CSV Data ( Sample )

[root@sandbox ~]# hadoop fs -cat /user/hue/transactions/store_transactions.csv|head -10
10001,2012-01-01 09:00:00.0,Fort Worth,Women's Clothing,153.57,Visa
10002,2012-01-01 09:00:00.0,San Diego,Music,66.08,Cash
10003,2012-01-01 09:00:00.0,Pittsburgh,Pet Supplies,493.51,Discover
10004,2012-01-01 09:00:00.0,Omaha,Children's Clothing,235.63,MasterCard
10005,2012-01-01 09:00:00.0,Stockton,Men's Clothing,247.18,MasterCard
10006,2012-01-01 09:00:00.0,Austin,Cameras,379.6,Visa
10007,2012-01-01 09:00:00.0,New York,Consumer Electronics,296.8,Cash
10008,2012-01-01 09:00:00.0,Corpus Christi,Toys,25.38,Discover
10009,2012-01-01 09:00:00.0,Fort Worth,Toys,213.88,Visa
10010,2012-01-01 09:00:00.0,Las Vegas,Video Games,53.26,Visa



Step 4

Execution Results


Output CSV Data ( Sample )

[root@sandbox ~]# hadoop fs -cat /user/hue/transactions/trans_out/part-r-00000
Mesa,27046.94
Reno,23507.56
Boise,24273.33
Miami,25034.11
Omaha,28500.32
Plano,26486.6
Tampa,22132.69
Tulsa,18669.55
Aurora,28181.13
Austin,25597.62

Execution Log

2014/02/18 16:13:11 - Spoon - Asking for repository
2014/02/18 16:13:11 - Version checker - OK
2014/02/18 16:13:17 - Spoon - Starting job...
2014/02/18 16:13:17 - job_demo_pig - Start of job execution
2014/02/18 16:13:17 - job_demo_pig - Starting entry [Pig Script Executor]
2014/02/18 16:13:17 - Pig Script Executor - 2014/02/18 16:13:17 - Connecting to hadoop file system at: hdfs://192.168.154.131:8020
2014/02/18 16:13:18 - Pig Script Executor - 2014/02/18 16:13:18 - Connecting to map-reduce job tracker at: 192.168.154.131:8021
2014/02/18 16:13:19 - Pig Script Executor - 2014/02/18 16:13:19 - Setting Parallelism to 1
2014/02/18 16:13:19 - Pig Script Executor - 2014/02/18 16:13:19 - creating jar file Job7705643851097701358.jar
2014/02/18 16:13:21 - Pig Script Executor - 2014/02/18 16:13:21 - jar file Job7705643851097701358.jar created
2014/02/18 16:13:21 - Pig Script Executor - 2014/02/18 16:13:21 - 1 map-reduce job(s) waiting for submission.
2014/02/18 16:13:21 - Pig Script Executor - 2014/02/18 16:13:21 - Total input paths to process : 1
2014/02/18 16:13:21 - Pig Script Executor - 2014/02/18 16:13:21 - Total input paths (combined) to process : 1
2014/02/18 16:13:22 - Pig Script Executor - 2014/02/18 16:13:22 - HadoopJobId: job_1392722352996_0029
2014/02/18 16:13:22 - Pig Script Executor - 2014/02/18 16:13:22 - Processing aliases grp_trans,sum_amt,trans,trans_line
2014/02/18 16:13:22 - Pig Script Executor - 2014/02/18 16:13:22 - detailed locations: M: trans[1,8],trans_line[2,13],sum_amt[4,10],grp_trans[3,12] C: sum_amt[4,10],grp_trans[3,12] R: sum_amt[4,10]
2014/02/18 16:13:22 - Pig Script Executor - 2014/02/18 16:13:22 - 0% complete
2014/02/18 16:13:33 - Pig Script Executor - 2014/02/18 16:13:33 - 50% complete
2014/02/18 16:13:42 - Pig Script Executor - 2014/02/18 16:13:42 - 100% complete
Input(s):
Successfully read 10001 records (621892 bytes) from: "/user/hue/transactions/store_transactions.csv"

Output(s):
Successfully stored 103 records (1912 bytes) in: "/user/hue/transactions/trans_out"

Counters:
Total records written : 103
Total bytes written : 1912
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1392722352996_0029
2014/02/18 16:13:42 - Pig Script Executor - 2014/02/18 16:13:42 - Success!
2014/02/18 16:13:42 - Pig Script Executor - Num successful jobs: 1 num failed jobs: 0
2014/02/18 16:13:42 - job_demo_pig - Finished job entry [Pig Script Executor] (result=[true])
2014/02/18 16:13:42 - job_demo_pig - Job execution finished
2014/02/18 16:13:42 - Spoon - Job has ended.

1 comment:

  1. I'm trying to execute a pig script and my HDInsight cluster is on Azure cloud.The script is executing fine when I run in powershell. But when I try to run using pig script executor in kettle, it is failing with below error.

    Failed to create DataStorage
    java.lang.RuntimeException: Failed to create DataStorage


    I have provided HDFS Hostname and other details.
    I'm not able to figure the issue here.Any help is appreciated.Thanks

    ReplyDelete