Wednesday, April 25, 2012

Pig Performance Measures


Performance tuning for pig script


       1. Avoid splits use filter

            Filter is a type of implicit split which splits the input according to the condition provided in the filter condition.

Example : 

Input 1
=======
1 2 3 4
5 6 7 8
0 46 47 48


Example1:

A = load '/home/hadoop/input/' using PigStorage('\t') as (a:int,b:int,c:int,d:int);
X = SPLIT A IF a >0 , B IF a>40;
DUMP X;

Example 2:

A = load '/home/hadoop/input/' using PigStorage('\t') as (a:int,b:int,c:int,d:int);
X = FILTER A BY a > 0;
Y = FILTER A BY a > 40;
DUMP X;
DUMP Y;

DESCRIPTION :
In above example using filter instead of split improves the performance of the pig.
This reduces processing a multiple times in same statement and also helps in intermediate
Storage of output which can be used for other processing.

2. Store Intermediate Results:
It is good to store intermediate results so that further processing’s can be done on top of that instead of processing the initial input again which is time consuming.

Example:

A = load '/home/hadoop/input/' using PigStorage('\t') as (a:int,b:int,c:int,d:int);
X = FILTER A BY a > 0;
L = GROUP X BY a;
Y = FOREACH L GENERATE SUM(b);
Z = FOREACH L GENERATE DIFFERENCE(b);
D= STORE Y IN ‘/home/hadoop/’;

In above example instead of reprocessing a to generate sum and difference of field it is better to store the result of L and do further processing on it. Which saves reprocessing of a again for same condition.

3. Remove Unnecessary dumps :
             DUMP is a interactive statement using which will disable multiquery processing and consumes more time than STORE statement.


Example:

A = load '/home/hadoop/input/' using PigStorage('\t') as (a:int,b:int,c:int,d:int);
X = FILTER A BY a > 0;
DUMP X;
L = GROUP X BY a;
Y = FOREACH L GENERATE SUM(b);
Z = FOREACH L GENERATE DIFFERENCE(b);
D= STORE Y IN ‘/home/hadoop/’;

Description :
        In above example aliases A and X will be processed which is dumped (which is treated as one job) and then A,X,L,Y,Z and D is performed so the same script is executed as two separate jobs because of dump statement. So remove all the pig scripts that are used for debugging.

4. Avoid using User Defined Functions:
           Unless it is necessary avoid UDF’s as it needs extra processing  other than running the pig scripts Which consumes time and CPU.Pig provides many built in functionalities which can be used in place of UDF’s
Some built in functions like SUM, COUNT,DISTINCT,FLATTEN,GROUPBY,FILTER can be used instead of elaborating everything in a UDF which is again wastage of time and memory.

5. Other ways to improve pig performance

a. Proper Usage of Operators:
 A = load '/home/hadoop/input/' using PigStorage('\t') as (a:int,b:int,c:int,d:int);
 B= FILTER A BY a by NOT (NOT(a>5) OR a>10);
Can be simplified using a AND operator which saves extra burden of doing two negations and one or to obtain the result.

A = load '/home/hadoop/input/' using PigStorage('\t') as (a:int,b:int,c:int,d:int);
 B= FILTER A BY a a>5 AND a<10;

 b. Avoid unnecessary calculations

 A = load '/home/hadoop/input/' using PigStorage('\t') as (a:int,b:int,c:int,d:int);
 B= FILTER A BY a > 1+2;
   Here instead of doing 1+2 operation for each value of a it can be simplified with pre computed value of 3.

B= FILTER A BY a > 3;

c. Avoid repetition of conditions in various forms

A = load '/home/hadoop/input/' using PigStorage('\t') as (a:int,b:int,c:int,d:int);
 B= FILTER A BY a > 0 and a > 3;

   Here if a is greater than 3 then obviously it should be greater than 0 hence this check can be avoided and simplified to

A = load '/home/hadoop/input/' using PigStorage('\t') as (a:int,b:int,c:int,d:int);
 B= FILTER A BY  a > 3;

4. Elimination of the condition which always results in result TRUE like

  B = FILTER A BY  1==1;

6. Minimize number of opertions performed

Example :

A = load '/home/hadoop/input/' using PigStorage('\t') as (a:int,b:int,c:int,d:int);
B = GROUP A BY a;
C = FOREACH B GENERATE a+b AS d,a-c AS e;
D = FOREACH C GENERATE d-c;
E = STORE D IN ‘/home/hadoop/output/’

Here above operation can be simplified to

A = load '/home/hadoop/input/' using PigStorage('\t') as (a:int,b:int,c:int,d:int);
B = GROUP A BY a;
C = FOREACH B GENERATE a+b-c AS d,a-c AS e;
E = STORE C IN ‘/home/hadoop/output/’

Since calculating a+b and then storing the result in a alias and subtracting c from it is same as doing a-b-c which saves time and memory.

8. Limit the number of output lines
          Store or dump only the number of lines you are interested by setting the limit.
         
A = load '/home/hadoop/input/' using PigStorage('\t') as (a:int,b:int,c:int,d:int);
B = FILTER A > 50;
C = LIMIT B 10;

Here instead of listing all the values of B , to view just 10 values limit it with the count.

PUSHUPFilters:
Instead of processing the entire data filter only the input that is needed and do processing on top of it .
Example :
A = load '/home/hadoop/input/' using PigStorage('\t') as (a:int,b:int,c:int,d:int);
B = FILTER A > 0;
C = FOREACH B GENERATE SUM(b);

In above example instead of computing sum for all the values of field b just filter the values that are greater than zero and perform the sum which prevents processing of  unwanted data.

9. Do Explicit Casting:
           Instead of allowing the pig to do the casting perform the casting explicitly when the data type is known which saves time.

A = load '/home/hadoop/input/' using PigStorage('\t') as (a:int,b:int,c:int,d:int);

As mentioned above load the data with explicit casting instead of
A = load '/home/hadoop/input/' using PigStorage('\t') as (a,b,c,d);

10  Filter only necessary fields while processing  :

      A = load '/home/hadoop/input/' using PigStorage('\t') as (a:int,b:int,c:int,d:int,e:int);
      B = FILTER A BY a >0;
      C = FILTER B BY b > 50;
      B = FOR EACH GENERATE COUNT(c);
      D = FOR EACH B GENERATE COUNT(d);
      E = UNION B,D;

Here in above example field D is used nowhere in the processing so it can be removed after loading the data to alias A which will save time and memory and improve performance.

Above script can be modified as

A = load '/home/hadoop/input/' using PigStorage('\t') as (a:int,b:int,c:int,d:int,e:int);
A1 = FOREACH A generate a,b,c,d;
B = FILTER A1 BY a >0;
C = FILTER A1 BY b > 50;
      B = FOR EACH GENERATE COUNT(c);
      D = FOR EACH B GENERATE COUNT(d);
      E = UNION B,D;

11. UDF Considerations :

         UDF functions are used to add customized functionalities to pig script.UDF’s if not coded in proper way can degrade the performance of scripts.Following are the things that should be added while creating a UDF.

A.    Place the UDF classes in proper package
B.    Register the jars before using calling udf’s
C.    Don’t register unwanted jar in the pig script.
D.   Call the functions with proper package and class names.
12. Drop Null Values before calculations :
           Remove values that are having only values as null has null values no significance in join,sum,difference and other  algebric functions they are usually ignored.

For example pig can be written as
A = load '/home/hadoop/input/' using PigStorage('\t') as (a:int,b:int);
B = FILTER A BY a is not null;
C = FILTER A BY b is not null;
D = JOIN B by a,C by b;

13. Use the PARLLEL clause
          PARLLEL clause in a job lets us to set number of reducer jobs that can run in parllel. Default the PARLLEL value is set to 1.If this PARLLEL class is not set by default we will get only one reduce job which will affect the performance of the pig script.
Example you can set PARLLEL class in pig script as

SET DEFAULT_PARLLEL 10;
This will set number of reducer than can run parallel as 10.

14. Load only the needed input
          In a file if you want to do processing only in first 50 rows then just load only 50 rows instead of loading the entire file.
For Example
A = load '/home/hadoop/input/' using PigStorage('\t') as (a:int,b:int);
B = limit A 100;

15. DISTINCT VS GROUP BY
         If  you want to find distinct values of a field then there are two approaches to obtain it and they are GROUP BY and DISTINCT. It is always better to use DISTINCT instead of GROUP BY to boost the performance of pig script.

Example
A = load '/home/hadoop/input/' using PigStorage('\t') as (a:int,b:int);
B = FOREACH A GENERATE a;
C = GROUP B BY a;
Can be improved by changing it as

Example
A = load '/home/hadoop/input/' using PigStorage('\t') as (a:int,b:int);
B = FOREACH A GENERATE a;
C = distict b;

15. Filter Early an dofter
         If possible filter the values before doing a group by.Since combiner improves the performance of a script which can be be used only without operators between group by and for each.

Example:
A = load '/home/hadoop/input/' using PigStorage('\t') as (a:int,b:int);
B = FOREACH A GENERATE a;
C = GROUP B BY a;
D = FOREACH A GENERATE sum(b);
E = FILTER D BY b>0;

Can be optimized as

A = load '/home/hadoop/input/' using PigStorage('\t') as (a:int,b:int);
B = FOREACH A GENERATE a;
C = GROUP B BY a;
C1 = FILTER A BY b >0;
D = FOREACH A GENERATE sum(b);

16. Keep huge data last
While doing a join or cogroup with input of multiple size always keep the large data at the end
Doing so will improve the pig performance and will make it more efficient.


3 comments:

my blogs said...

Hi JAGARAN
I am Amar and I am learning Hadoop. But I am not able to understand cluster set up. Can you plz send a just like a Helloworld Kind of code regarding cluster set up in hadoop. My email id is "panigrahi.amar@gmail.com"

For that i will be thankfull to you forever.

my blogs said...

Hi JAGARAN
I am Amar and I am learning Hadoop. But I am not able to understand cluster set up. Can you plz send a just like a Helloworld Kind of code regarding cluster set up in hadoop. My email id is "panigrahi.amar@gmail.com"

For that i will be thankfull to you forever.

JAGARAN said...

I would send that sure