I have 3 LZO compressed files and their corresponding index files in HDFS.
Permission Owner Group Size Replication Block Size Name
-rw-r--r-- alum supergroup 0 B 3 128 MB _SUCCESS
-rw-r--r-- alum supergroup 192.29 MB 3 128 MB part-00000.lzo
-rw-r--r-- alum supergroup 89.56 KB 3 128 MB part-00000.lzo.index
-rw-r--r-- alum supergroup 243.09 MB 3 128 MB part-00001.lzo
-rw-r--r-- alum supergroup 106.67 KB 3 128 MB part-00001.lzo.index
-rw-r--r-- alum supergroup 163.99 MB 3 128 MB part-00002.lzo
-rw-r--r-- alum supergroup 70.54 KB 3 128 MB part-00002.lzo.index
We copied these files to Amazon S3 and created Hive external table for analytics.
Here are the problems that we are facing,
1) LZO index files are also being treated as data files and meaningless data appears in hive tables
2) "count(*)" query on the table spans only 4 mappers. Indicating problem in splitting.
Could you please let me whats going on S3? It works fine in our YARN cluster.
s3 is treated differently than HDFS. Split logic need not be applied as in HDFS. Remember s3 is cloud storage where as HDFS is not local storage. Your files will not be in the form of blocks in s3. This behavior is expected.