I'm trying to read a zst-compressed file using Spark on Scala.
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val schema = new StructType()
.add("title", StringType, true)
.add("selftext", StringType, true)
.add("score", LongType, true)
.add("created_utc", LongType, true)
.add("subreddit", StringType, true)
.add("author", StringType, true)
val df_with_schema = spark.read.schema(schema).json("/home/user/repos/concepts/abcde/RS_2019-09.zst")
df_with_schema.take(1)
Unfortunately this produces the following error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (192.168.0.101 executor driver): java.lang.RuntimeException: native zStandard library not available: this version of libhadoop was built without zstd support.
My hadoop checknative looks as follows, but I understand from here that Apache Spark has its own ZStandardCodec.
Native library checking:
- hadoop: true /opt/hadoop/lib/native/libhadoop.so.1.0.0
- zlib: true /lib/x86_64-linux-gnu/libz.so.1
- zstd : true /lib/x86_64-linux-gnu/libzstd.so.1
- snappy: true /lib/x86_64-linux-gnu/libsnappy.so.1
- lz4: true revision:10301
- bzip2: true /lib/x86_64-linux-gnu/libbz2.so.1
- openssl: false EVP_CIPHER_CTX_cleanup
- ISA-L: false libhadoop was built without ISA-L support
- PMDK: false The native code was built without PMDK support.
Any ideas are appreciated, thank you!
UPDATE 1: As per this post, I've understood better what the message meant, namely that zstd is not enabled when compiling Hadoop by default, so one of possible solutions would be obviously building it with that flag enabled.
Since I didn't want to build Hadoop by myself, inspired by the workaround used here, I've configured Spark to use Hadoop native libraries:
I can now read the zst archive into a DataFrame with no issues.