Saturday, August 13, 2011

Struggling to Copy to HDFS

Actually we are really struggling to copy data from HDFS. We are using hadoop to aggregate data and make reports for executives.
We have a SLA of 15 minutes.
Currently we do a copy from local periodically using a shell script and data volume is around 120 GB/day and it works.
File sizes are small and so we run a pig script after copying to make file size optimum.

But Recently due to some major product releases coming shortly , the projected data would increase to 12 TB/day. (Sales Prjection).
Our current solution would not scale to maintain the SLA.
So thought of few options to optimize the copy to hadoop:

1. Currently we copy from a mounted filer - There is a separate application that does the copy to filer. I thought of streaming directly to hdfs using FileSystem Class and opening multiple connections and writing through it continuously. There could be a peak period in a day where we may have to write 1GB/sec. Would this solution scale??
2. Going by our current approach, we would copy from the filer but to speed up things
a. We have to do a copy in a multithreaded way - Basically running copyFromLocal by multiple threads
b. Saw a blog where people have a single node cluster, then did a distcp to the original big cluster. They said, discp works in parallel
c. A Map-Reduce Code to do copy from local filesystem to hdfs.I need to have a custom FileInputFormat Class

No comments: