Monday, August 22, 2011

Map Reduce Job to Calculate Total Lines in Your Dataset

"Get the total lines in our logs " - A common requirement that boils down from an Application Manager to an poor Hadoop Engineer. "What is the ETA ? " - Next question in the firing line. Common belief now is "Hadoop is the Gin's Magic Lamp", just stand infront and say "I need this......" and it would say "Aka...Here is your total line count". I would rather say hadoop is like "Pandora's Box", you open it and discover many things. So let me come to the point. I have the solution for the poor hadoop guy as below:

import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class LineCount {
public static class Map extends MapReduceBase implements Mapper {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text("AKA Total Lines For You....");
public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {
output.collect(word, one);
}
}
public static class Reduce extends MapReduceBase implements Reducer {
public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(LineCount.class);
conf.set("fs.default.name","hdfs://localhost:8020/home/hadoop/");
conf.setJobName("LineCount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path("file:///Users/jagarandas/Work-Assignment/Analytics/analytics-poc/sample-data/"));
FileOutputFormat.setOutputPath(conf, new Path("/home/hadoop/sample-data1/"));
JobClient.runJob(conf);
}
}

No comments: