Actually we are really struggling to copy data from HDFS. We are using hadoop to aggregate data and make reports for executives.
We have a SLA of 15 minutes.
Currently we do a copy from local periodically using a shell script and data volume is around 120 GB/day and it works.
File sizes are small and so we run a pig script after copying to make file size optimum.
But Recently due to some major product releases coming shortly , the projected data would increase to 12 TB/day. (Sales Prjection).
Our current solution would not scale to maintain the SLA.
So thought of few options to optimize the copy to hadoop:
1. Currently we copy from a mounted filer - There is a separate application that does the copy to filer. I thought of streaming directly to hdfs using FileSystem Class and opening multiple connections and writing through it continuously. There could be a peak period in a day where we may have to write 1GB/sec. Would this solution scale??
2. Going by our current approach, we would copy from the filer but to speed up things
a. We have to do a copy in a multithreaded way - Basically running copyFromLocal by multiple threads
b. Saw a blog where people have a single node cluster, then did a distcp to the original big cluster. They said, discp works in parallel
c. A Map-Reduce Code to do copy from local filesystem to hdfs.I need to have a custom FileInputFormat Class
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment