I have a tar archive (about 40 GB) which has many subfolders within which my data resides. The structure is : Folders -> Sub Folders -> json.bz2 files. TAR file:
Total size: ~ 40GB
Number of inner .bz2 files (arranged in folders): 50,000
Size of one .bz2 file: ~700kb
Size of one extracted JSON file: ~6 MB.
I have to load the json files into the HDFS cluster. I am trying to manually extract it in my local directory but I am running out of space. I am planning to load the archive directly into HDFS and then uncompress it . But I dont know whether it is a good way to solve the problem. As I am new to Hadoop any pointers would be helpful.