Performance tuning for pig script
1. Avoid splits use filter
Filter is a type of implicit split which splits the input according to
the condition provided in the filter condition.
Example :
Input 1
=======
1 2 3 4
5 6 7 8
0 46 47 48
Example1:
A = load '/home/hadoop/input/'
using PigStorage('\t') as (a:int,b:int,c:int,d:int);
X = SPLIT A IF a >0 , B IF
a>40;
DUMP X;
Example 2:
A = load '/home/hadoop/input/'
using PigStorage('\t') as (a:int,b:int,c:int,d:int);
X = FILTER A BY a > 0;
Y = FILTER A BY a > 40;
DUMP X;
DUMP Y;
DESCRIPTION :
In above example using filter
instead of split improves the performance of the pig.
This reduces processing a multiple times in same statement and
also helps in intermediate
Storage of output which can be
used for other processing.
2. Store Intermediate Results:
It is good to store intermediate
results so that further processing’s can be done on top of that instead of
processing the initial input again which is time consuming.
Example:
A = load '/home/hadoop/input/'
using PigStorage('\t') as (a:int,b:int,c:int,d:int);
X = FILTER A BY a > 0;
L = GROUP X BY a;
Y = FOREACH L GENERATE SUM(b);
Z = FOREACH L GENERATE
DIFFERENCE(b);
D= STORE Y IN ‘/home/hadoop/’;
In above example instead of
reprocessing a to generate sum and difference of field it is better to store
the result of L and do further processing on it. Which saves reprocessing of a
again for same condition.
3. Remove Unnecessary dumps :
DUMP is a interactive statement using which will disable
multiquery processing and consumes more time than STORE statement.
Example:
A = load '/home/hadoop/input/'
using PigStorage('\t') as (a:int,b:int,c:int,d:int);
X = FILTER A BY a > 0;
DUMP X;
L = GROUP X BY a;
Y = FOREACH L GENERATE SUM(b);
Z = FOREACH L GENERATE
DIFFERENCE(b);
D= STORE Y IN ‘/home/hadoop/’;
Description :
In above example aliases
A and X will be processed which is dumped (which is treated as one job) and then
A,X,L,Y,Z and D is performed so the same script is executed as two separate
jobs because of dump statement. So remove all the pig scripts that are used for
debugging.
4. Avoid using User Defined Functions:
Unless it is necessary avoid UDF’s
as it needs extra processing other
than running the pig scripts Which consumes time and CPU.Pig provides many
built in functionalities which can be used in place of UDF’s
Some built in functions like SUM,
COUNT,DISTINCT,FLATTEN,GROUPBY,FILTER can be used instead of elaborating
everything in a UDF which is again wastage of time and memory.
5. Other ways to improve pig performance
a. Proper Usage of Operators:
A = load '/home/hadoop/input/'
using PigStorage('\t') as (a:int,b:int,c:int,d:int);
B= FILTER A BY a by NOT
(NOT(a>5) OR a>10);
Can be simplified using a AND
operator which saves extra burden of doing two negations and one or to obtain
the result.
A = load '/home/hadoop/input/'
using PigStorage('\t') as (a:int,b:int,c:int,d:int);
B= FILTER A BY a a>5 AND
a<10;
b. Avoid unnecessary
calculations
A = load '/home/hadoop/input/'
using PigStorage('\t') as (a:int,b:int,c:int,d:int);
B= FILTER A BY a > 1+2;
Here instead of doing 1+2 operation for each value of
a it can be simplified with pre computed value of 3.
B= FILTER A BY a > 3;
c. Avoid repetition of conditions in various forms
A = load '/home/hadoop/input/'
using PigStorage('\t') as (a:int,b:int,c:int,d:int);
B= FILTER A BY a > 0 and a >
3;
Here if a is greater than 3 then obviously it should
be greater than 0 hence this check can be avoided and simplified to
A = load '/home/hadoop/input/'
using PigStorage('\t') as (a:int,b:int,c:int,d:int);
B= FILTER A BY a > 3;
4. Elimination of the condition
which always results in result TRUE like
B = FILTER A BY 1==1;
6. Minimize number of opertions performed
Example :
A = load '/home/hadoop/input/'
using PigStorage('\t') as (a:int,b:int,c:int,d:int);
B = GROUP A BY a;
C = FOREACH B GENERATE a+b AS
d,a-c AS e;
D = FOREACH C GENERATE d-c;
E = STORE D IN ‘/home/hadoop/output/’
Here above operation can be simplified to
A = load '/home/hadoop/input/'
using PigStorage('\t') as (a:int,b:int,c:int,d:int);
B = GROUP A BY a;
C = FOREACH B GENERATE a+b-c AS
d,a-c AS e;
E = STORE C IN ‘/home/hadoop/output/’
Since calculating a+b and then
storing the result in a alias and subtracting c from it is same as doing a-b-c
which saves time and memory.
8. Limit the number of output lines
Store or dump only the number of
lines you are interested by setting the limit.
A = load '/home/hadoop/input/'
using PigStorage('\t') as (a:int,b:int,c:int,d:int);
B = FILTER A > 50;
C = LIMIT B 10;
Here instead of listing all the
values of B , to view just 10 values limit it with the count.
PUSHUPFilters:
Instead of processing the entire
data filter only the input that is needed and do processing on top of it .
Example :
A = load '/home/hadoop/input/'
using PigStorage('\t') as (a:int,b:int,c:int,d:int);
B = FILTER A > 0;
C = FOREACH B GENERATE SUM(b);
In above example instead of
computing sum for all the values of field b just filter the values that are
greater than zero and perform the sum which prevents processing of unwanted data.
9. Do Explicit Casting:
Instead of allowing the pig to do the casting perform the casting
explicitly when the data type is known which saves time.
A = load '/home/hadoop/input/'
using PigStorage('\t') as (a:int,b:int,c:int,d:int);
As mentioned above load the data
with explicit casting instead of
A = load '/home/hadoop/input/'
using PigStorage('\t') as (a,b,c,d);
10 Filter only necessary
fields while processing :
A = load '/home/hadoop/input/'
using PigStorage('\t') as (a:int,b:int,c:int,d:int,e:int);
B = FILTER A BY a >0;
C = FILTER B BY b > 50;
B = FOR EACH GENERATE COUNT(c);
D = FOR EACH B GENERATE COUNT(d);
E = UNION B,D;
Here in above example field D is
used nowhere in the processing so it can be removed after loading the data to
alias A which will save time and memory and improve performance.
Above script can be modified as
A = load '/home/hadoop/input/' using PigStorage('\t') as (a:int,b:int,c:int,d:int,e:int);
A1 = FOREACH A generate a,b,c,d;
B = FILTER A1 BY a >0;
C = FILTER A1 BY b > 50;
B = FOR EACH GENERATE COUNT(c);
D = FOR EACH B GENERATE COUNT(d);
E = UNION B,D;
11. UDF Considerations :
UDF functions are
used to add customized functionalities to pig script.UDF’s if not coded in
proper way can degrade the performance of scripts.Following are the things that
should be added while creating a UDF.
A. Place the UDF classes in proper
package
B. Register the jars before using calling
udf’s
C. Don’t register unwanted jar in the pig script.
D. Call the functions with proper
package and class names.
12. Drop Null Values before calculations :
Remove values that are having only
values as null has null values no significance in join,sum,difference and other algebric functions they are usually
ignored.
For example pig can be written as
A = load '/home/hadoop/input/'
using PigStorage('\t') as (a:int,b:int);
B = FILTER A BY a is not null;
C = FILTER A BY b is not null;
D = JOIN B by a,C by b;
13. Use the PARLLEL clause
PARLLEL
clause in a job lets us to set number of reducer jobs that can run in parllel.
Default the PARLLEL value is set to 1.If this PARLLEL class is not set by
default we will get only one reduce job which will affect the performance of
the pig script.
Example you can set PARLLEL class
in pig script as
SET DEFAULT_PARLLEL 10;
This will set number of reducer
than can run parallel as 10.
14. Load only the needed input
In a file if
you want to do processing only in first 50 rows then just load only 50 rows
instead of loading the entire file.
For Example
A = load '/home/hadoop/input/'
using PigStorage('\t') as (a:int,b:int);
B = limit A 100;
15. DISTINCT VS GROUP BY
If you want to find distinct values of a
field then there are two approaches to obtain it and they are GROUP BY and
DISTINCT. It is always better to use DISTINCT instead of GROUP BY to boost the
performance of pig script.
Example
A = load '/home/hadoop/input/'
using PigStorage('\t') as (a:int,b:int);
B = FOREACH A GENERATE a;
C = GROUP B BY a;
Can be improved by changing it as
Example
A = load '/home/hadoop/input/'
using PigStorage('\t') as (a:int,b:int);
B = FOREACH A GENERATE a;
C = distict b;
15. Filter Early an dofter
If possible filter
the values before doing a group by.Since combiner improves the performance of a
script which can be be used only without operators between group by and for
each.
Example:
A = load '/home/hadoop/input/' using
PigStorage('\t') as (a:int,b:int);
B = FOREACH A GENERATE a;
C = GROUP B BY a;
D = FOREACH A GENERATE sum(b);
E = FILTER D BY b>0;
Can be optimized as
A = load '/home/hadoop/input/'
using PigStorage('\t') as (a:int,b:int);
B = FOREACH A GENERATE a;
C = GROUP B BY a;
C1 = FILTER A BY b >0;
D = FOREACH A GENERATE sum(b);
16. Keep huge data last
While doing a join or cogroup with
input of multiple size always keep the large data at the end
Doing so will improve the pig
performance and will make it more efficient.