How to extarct contents of bz2 files - Hadoop

573 views Asked by At

I have a tar archive (about 40 GB) which has many subfolders within which my data resides. The structure is : Folders -> Sub Folders -> json.bz2 files. TAR file:

Total size: ~ 40GB
Number of inner .bz2 files (arranged in folders): 50,000
Size of one .bz2 file: ~700kb
Size of one extracted JSON file: ~6 MB.

I have to load the json files into the HDFS cluster. I am trying to manually extract it in my local directory but I am running out of space. I am planning to load the archive directly into HDFS and then uncompress it . But I dont know whether it is a good way to solve the problem. As I am new to Hadoop any pointers would be helpful.

0

There are 0 answers