Why does my LZO indexing take so long on Amazon's EMR when reading from S3?

819 views Asked by At

I have a 30gb lzo file on S3, and I'm using hadoop-lzo to index it with Amazon EMR (AMI v2.4.2), using region us-east1.

elastic-mapreduce --create --enable-debugging \
    --ami-version "latest" \
    --log-uri s3n://mybucket/mylogs \
    --name "lzo index test" \
    --num-instances 2 \
    --master-instance-type "m1.xlarge"  --slave-instance-type "cc2.8xlarge" \
    --jar s3n://mybucket/hadoop-lzo-0.4.17-SNAPSHOT.jar \
      --arg com.hadoop.compression.lzo.DistributedLzoIndexer \
      --arg s3://mybucket/my-30gb-file.lzo \
      --step-name "Index LZO files"

It takes about 10 minutes for 1% progress, thus about 16 hours for one file to finish. Progress shows only 80mb have been read.

In contrast, using the same cluster (while the job described above is running), I can copy the file from S3 to the local hard disk, then to HDFS, and finally run the indexer in about 10 minutes total time. Likewise, my local cluster can process this in about 7 minutes.

In the past, I believe I've run LZO indexing directly on S3 without this bad of a delay, although it was on a earlier AMI version. I don't know what AMI I was using since I always use "latest". (Update: I tried AMI v2.2.4 with same result, so maybe I'm mis-remembering or something else is causing the slowness)

Any ideas what could be going on?

Here's a copy of the Step's log output:

Task Logs: 'attempt_201401011330_0001_m_000000_0'


stdout logs



stderr logs



syslog logs

2014-01-01 13:32:39,764 INFO org.apache.hadoop.util.NativeCodeLoader (main): Loaded the native-hadoop library
2014-01-01 13:32:40,043 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl (main): Source name ugi already exists!
2014-01-01 13:32:40,120 INFO org.apache.hadoop.mapred.MapTask (main): Host name: ip-10-7-132-249.ec2.internal
2014-01-01 13:32:40,134 INFO org.apache.hadoop.util.ProcessTree (main): setsid exited with exit code 0
2014-01-01 13:32:40,138 INFO org.apache.hadoop.mapred.Task (main):  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@5c785f0b
2014-01-01 13:32:40,943 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader (main): Loaded native gpl library
2014-01-01 13:32:41,104 WARN com.hadoop.compression.lzo.LzoCodec (main): Could not find build properties file with revision hash
2014-01-01 13:32:41,104 INFO com.hadoop.compression.lzo.LzoCodec (main): Successfully loaded & initialized native-lzo library [hadoop-lzo rev UNKNOWN]
2014-01-01 13:32:41,121 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy (main): Snappy native library is available
2014-01-01 13:32:41,121 INFO org.apache.hadoop.io.compress.snappy.LoadSnappy (main): Snappy native library loaded
2014-01-01 13:32:41,314 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3://mybucket/my-30gb-file.lzo' for reading
2014-01-01 13:32:41,478 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '63624'
2014-01-01 13:32:41,773 INFO com.hadoop.mapreduce.LzoIndexRecordWriter (main): Setting up output stream to write index file for s3://mybucket/my-30gb-file.lzo
2014-01-01 13:32:41,885 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Delete called for 's3://mybucket/my-30gb-file.lzo.index.tmp' but file does not exist, so returning false
2014-01-01 13:32:41,928 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Delete called for 's3://mybucket/my-30gb-file.lzo.index' but file does not exist, so returning false
2014-01-01 13:32:41,967 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Creating new file 's3://mybucket/my-30gb-file.lzo.index.tmp' in S3
2014-01-01 13:32:42,017 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '125908'
2014-01-01 13:32:42,227 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '187143'
2014-01-01 13:32:42,516 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '249733'
  ... (repeat of same "Stream for key" message)
2014-01-01 13:34:14,991 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '62004474'
2014-01-01 13:34:15,077 INFO com.hadoop.mapreduce.LzoSplitRecordReader (main): Reading block 1000 at pos 61941702 of 39082185217. Read is 0.15865149907767773% done. 
2014-01-01 13:34:15,077 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '62067843'
  ... (repeat of same "Stream for key" message)
2014-01-01 13:35:37,849 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '123946504'
2014-01-01 13:35:37,911 INFO com.hadoop.mapreduce.LzoSplitRecordReader (main): Reading block 2000 at pos 123882460 of 39082185217. Read is 0.31714322976768017% done. 
2014-01-01 13:35:37,911 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '124008849'
  ... (repeat of same "Stream for key" message)

My Workaround

FWIW, my workaround is to copy the file to HDFS via distcp (see below). In my eyes, this slowness seems like an issue that AWS can improve upon though. In job below, copying from S3 to HDFS takes 17 minutes and the indexing takes only 1 minute.

elastic-mapreduce --create --enable-debugging --alive \
    --ami-version "latest" \
    --log-uri s3n://mybucket/logs/dailyUpdater \
    --name "daily updater test" \
    --num-instances 2 \
    --master-instance-type "m1.xlarge"  --slave-instance-type "cc2.8xlarge" \
    --jar s3://elasticmapreduce/samples/distcp/distcp.jar \
      --arg s3://mybucket/my-30gb-file.lzo \
      --arg hdfs:///my-30gb-file.lzo \
      --step-name "Upload input file to HDFS" \
    --jar s3n://mybucket/hadoop-lzo-0.4.17-SNAPSHOT.jar \
      --arg com.hadoop.compression.lzo.DistributedLzoIndexer \
      --arg hdfs:///my-30gb-file.lzo \
      --step-name "Index LZO files" \
    --jar s3://elasticmapreduce/samples/distcp/distcp.jar \
      --arg hdfs:///my-30gb-file.lzo.index \
      --arg s3://mybucket/my-30gb-file.lzo.index \
      --step-name "Upload index to S3"
1

There are 1 answers

0
Judge Mental On BEST ANSWER

Seeking within a stream on s3 is implemented as a GET with a byte range header field. It is eminently reasonable for such a call to require a few hundred milliseconds. Since the indexing process appears to require a lot of seeks, even though they are all in the forward direction, you are effectively re-opening the file thousands of times.

Your workaround is the correct approach. S3 is optimized for sequential, not random, access.