Consider this scenario:
I have 4 files each 6 MB each. HDFS
block size is 64 MB.
1 block
will hold all these files. It has some extra space. If new files are added, it will accommodate here
Now when the input splits
are calculated for Map-reduce
job by Input format
, (split size
are usually HDFS block size
so that each split can be loaded into memory for processing, there by reducing seek time.)
how many input splits are made here:
is it one because all the 4 files are contained with in a
block
?or is it one input split per file?
how is this determined? what if I want all files to be processed as a single input split?
You'll actually have 4 blocks. It doesn't matter if all files can fit into a single block or not.
EDIT: Blocks belong to a file, not the other way around. HDFS is designed to store large files that are almost certainly going to be larger than your block size. Storing multiple files per block would add unnecessary complexity to the namenode...
blk0001
, it's nowblk0001 {file-start -> file-end}
.Still 1 split per file.
This is how.
Use a different input format, such as
MultipleFileInputFormat
.