LZO files issue on S3

200 views Asked by At

I have 3 LZO compressed files and their corresponding index files in HDFS.

Permission  Owner   Group   Size    Replication Block Size  Name
-rw-r--r--  alum    supergroup  0 B 3   128 MB  _SUCCESS
-rw-r--r--  alum    supergroup  192.29 MB   3   128 MB  part-00000.lzo
-rw-r--r--  alum    supergroup  89.56 KB    3   128 MB  part-00000.lzo.index
-rw-r--r--  alum    supergroup  243.09 MB   3   128 MB  part-00001.lzo
-rw-r--r--  alum    supergroup  106.67 KB   3   128 MB  part-00001.lzo.index
-rw-r--r--  alum    supergroup  163.99 MB   3   128 MB  part-00002.lzo
-rw-r--r--  alum    supergroup  70.54 KB    3   128 MB  part-00002.lzo.index

We copied these files to Amazon S3 and created Hive external table for analytics.

Here are the problems that we are facing,

1) LZO index files are also being treated as data files and meaningless data appears in hive tables
2) "count(*)" query on the table spans only 4 mappers. Indicating problem in splitting.

Could you please let me whats going on S3? It works fine in our YARN cluster.

1

There are 1 answers

0
Durga Viswanath Gadiraju On

s3 is treated differently than HDFS. Split logic need not be applied as in HDFS. Remember s3 is cloud storage where as HDFS is not local storage. Your files will not be in the form of blocks in s3. This behavior is expected.