Sunday, August 21, 2016

Executing scripts

Pig Lab 11: Executing Scripts and UDFs.

Excecuting scripts(pig) :-
____________________________

   three commands (operators) are used to execute scripts.

  i) pig
  ii) exec. 
  iii) run.

i) Pig:-
 
   to execute script from Command Line(operating sys).

  $ pig  script1.pig
--> script will be executed,
   but relation aliases are not available with grunt shell. so that we  can not reuse them.

ii) exec:

--> to execute script from grunt shell.
still aliases will not be available with grunt.
   so "No reuse".

grunt> exec  script1.pig

__________________________________

iii) run:

---> to execute script from grunt shell,
   Aliases will be available with grunt. So we can reuse them.

grunt> run script1.pig
______________________________

run:
  adv --> aliases will be available.
  disadv --> overriding previous aliases with same name.

exec:
   adv --> aliases will not be available.
    so no -0verriding.
  disadv --> no reusability.

pig :
  adv --->
      -- used for production operators.
     --- can be called other evenvironments , like shell script.

disadv --> aliases will not be reflected into grunt.

_________________________________

Pig Udfs:
_____________j

   User defined functions.

   adv:
   i) Custom functionality.
   ii) Reusabilty .

Udf life cycle:

step 1) Develop UDF class
step 2) Export into jar file.
step 3) register jar file into pig.
step 4) create temporary function for  the  UDF class.
step 5) call the function.
__________________________

[training@localhost ~]$ cat > samp1
101,ravi
102,mani
103,Deva
104,Devi
105,AmAr
[training@localhost ~]$ hadoop fs -copyFromLocal samp1  piglab
[training@localhost ~]$

grunt> s = load 'piglab/samp1'
>>    using PigStorage(',')
>>   as (id:int, name:chararray);

eclipse navigations:
i) create java project.

file --> new --> java project.
  ex: PigDemo

ii) create package>

  pigDemo ---> new ---> package.

    ex:    pig.test

  iii) configure pig jar.

src -- build path --> configure build path --> libraries ---> add external jars.
 
  /usr/lib/pig/pig-core.jar

  iv) create jar class

pig.test ---> new --> class

   FirstUpper

  v)  export into jar.

pigdemo ---> export --> java --java jar --
    /home/training/Desktop/pigudfs.jar

_____________________
package pig.test;
import java.io.IOException;

import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class FirstUpper  extends EvalFunc<String>
{
  public String   exec(Tuple v) throws IOException
  {    //   raVI  --> Ravi

     String name = (String)v.get(0);
   String fc = name.substring(0,1).toUpperCase();
   String rc = name.substring(1).toLowerCase();
   String n = fc+rc;
   return n;
  }

}

grunt> register  Desktop/pigudfs.jar;

grunt> define cconvert pig.test.FirstUpper();

grunt> r = foreach s generate
>>       id, cconvert(name) as name;

grunt> dump r;

______________________________

[training@localhost ~]$ cat > f1
100     200     120
300     450     780
120     56      90
1000    3456    789
[training@localhost ~]$ hadoop fs -copyFromLocal f1 piglab
[training@localhost ~]$

task:
  write udf , to find max value for a row.

package pig.test;

import java.io.IOException;

import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class RowMax extends EvalFunc<Integer>
{
public Integer  exec(Tuple v) throws IOException
{
  int a = (Integer)v.get(0);
        int b = (Integer)v.get(1);
        int c = (Integer)v.get(2);
  int big =0; 
  if (a>big) big=a;
  if (b>big) big=b;
  if (c>big) big=c;
  return  new Integer(big);
}

}

export into jar.

    /home/training/Desktop/pigudfs.jar

grunt> s1 = load 'piglab/f1'
>>     as (a:int, b:int, c:int);
grunt> register Desktop/pigudfs.jar;
grunt> define rowmax pig.test.RowMax();
grunt> r1 = foreach s1 generate  *,
>>           rowmax(*) as rmax;
grunt> dump r1
(100,200,120,200)
(300,450,780,780)
(120,56,90,120)
(1000,3456,789,3456)

[training@localhost ~]$ cat f2
-10,-30,-56,-23,-21,-5
1,2,3,45,67,9
[training@localhost ~]$ hadoop fs -copyFromLocal f2 piglab
[training@localhost ~]$

package pig.test;

import java.io.IOException;
import java.util.List;

import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class DynRowMax  extends EvalFunc<Integer>
{
public Integer  exec(Tuple v) throws IOException
{
  List<Object>  olist  =  v.getAll();
   int max = 0;//  10  30  20
   int cnt=0;
   for( Object o : olist){
       cnt++;
       int val= (Integer)o;
       if (cnt==1)  max = val;
                max = Math.max(val, max);
   }
  return new Integer(max);
}

}

export into jar   /home/training/Desktop/pigudfs.jar

grunt> register Desktop/pigudfs.jar;
grunt> define dynmax pig.test.DynRowMax();
grunt> ss = load 'piglab/f2'   
>>    using PigStorage(',')
>>   as (a:int, b:int, c:int, d:int,
>>     e:int, f:int);
grunt> define  rmax pig.test.RowMax();
grunt> rr = foreach ss generate *,
>>      dynmax(*) as max;
grunt> dump rr

No comments:

Post a Comment