Serve static files from Hadoop

1.6k views Asked by At

My job is to design a distributed system for static image/video files. The size of the data is about tens of Terabytes. It's mostly for HTTP access (thus no processing on data; or only simple processing such as resizing- however it's not important because it can be done directly in the application).

To be a little more clear, it's a system that:

  1. Must be distributed (horizontal scale), because the total size of data is very big.
  2. Primarily serves small static files (such as images, thumbnails, short videos) via HTTP.
  3. Generally, no requirement on processing the data (thus MapReduce is not needed)
  4. Setting HTTP access on the data could be done easily.
  5. (Should have) good throughput.

I am considering:

  1. Native network file system: But it seems not feasible because the data can not fit into one machine.

  2. Hadoop filesystem. I worked with Hadoop mapreduce before, but I have no experience using Hadoop as a static file repository for HTTP requests. So I don't know if it's possible or if it's a recommended way.

  3. MogileFS. It seems promising, but I feel that using MySQL to manage local files (on a single machine) will create too much overhead.

Any suggestion please?

4

There are 4 answers

2
chrislusf On BEST ANSWER

I am the author of Weed-FS. For your requirement, WeedFS is ideal. Hadoop can not handle many small files, in addition to your reasons, each file needs to have an entry in the master. If the number of files are big, the hdfs master node can not scale.

Weed-FS is getting faster when compiled with latest Golang releases.

Many new improvements have been done on Weed-FS recently. Now you can test and compare very easily with the built-in upload tool. This one upload all files recursively under a directory.

weed upload -dir=/some/directory

Now you can compare by "du -k /some/directory" to see the disk usage, and "ls -l /your/weed/volume/directory" to see the Weed-FS disk usage.

And I suppose you would need replication with data center, rack aware, etc. They are in now!

0
Tejas Patil On

Hadoop has a rest API for acessing files. See this entry in the documentation. I feel that Hadoop is not meant for storing large number of small files.

  • HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern.
  • Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies 150 bytes. The block size is 64 mb. So even if the file is of 10kb, it would be allocated an entire block of 64 mb. Thats a waste disk space.
  • If the file is very small and there are a lot of them, then each map task processes very little input, and there are a lot more map tasks, each of which imposes extra bookkeeping overhead. Compare a 1GB file broken into 16 files of 64MB blocks, and 10,000 or so 100KB files. The 10,000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file.

In "Hadoop Summit 2011", there was this talk by Karthik Ranganathan about Facebook Messaging in which he gave away this bit: Facebook stores data (profiles, messages etc) over HDFS but they dont use the same infra for images and videos. They have their own system named Haystack for images. Its not open source but they shared the abstract design level details about it.

This brings me to weed-fs: an open source project for inspired by Haystacks' design. Its tailor made for storing files. I have not used it till now but seems worth a shot.

0
Arnon Rotem-Gal-Oz On

Hadoop is optimized for large files e.g. It's default block size is 64M. A lot of small files are both wasteful and hard to manage on Hadoop.

You can take a look at other distributed file systems e.g. GlusterFS

0
Gary Crawford On

If you are able to batch the files and have no requirement to update a batch after adding to HDFS, then you could compile multiple small files into a single larger binary sequence file. This is a more efficient way to store small files in HDFS (as Arnon points out above, HDFS is designed for large files and becomes very inefficient when working with small files).

This is the approach I took when using Hadoop to process CT images (details at Image Processing in Hadoop). Here the 225 slices of the CT scan (each an individual image) were compiled into a single, much larger, binary sequence file for long streaming reads into Hadoop for processing.

Hope this helps!

G