Tuesday, August 16, 2011

Copy To Hadoop In Scale

Data Copying in large volume has been very high latency job in HDFS. Recently we moved in to HDFS based aggregation from our legacy perl based aggregation. We already have a inbuilt infrastructure where data is being dumped into a network file system and then use perl aggregation jobs to crunch the data. As we moved into hadoop, we are facing difficulties in copying such volume of data to HDFS.As said in my previous post we were trying options to bring data into hadoop. I tried few options.
I have written a small piece of code that takes a list of files to copy in File URI format and process the copy in a multithreaded way using the FileSystem API methods of CHD3. The other option was to use the map reduce tools and use a Map Reduce jobs to copy your files into HDFS. We can use the raw processing power of your cluster in copying but IO time is of concern.
Would share the code in https://github.com/ shortly.
There are proper tools like Flume and Scribe to bring BigData in to hadoop.
Would come back with small code pieces to work with Flume and Scribe shortly.

No comments: