I have two datasets: Dataset[User]
and Dataset[Book]
where both User
and Book
are case classes. I join them like this:
val joinDS = ds1.join(ds2, "userid")
If I try to map
over each element in joinDS
, the compiler complains that an encoder is missing:
not enough arguments for method map: (implicit evidence$46: org.apache.spark.sql.Encoder[Unit])org.apache.spark.sql.Dataset[Unit].
Unspecified value parameter evidence$46.
Unable to find encoder for type stored in a Dataset.
But the same error does not occur if I use foreach
instead of map
. Why doesn't foreach
require an encoder as well? I have imported all implicits from the spark session already, so why does map
require an encoder at all, when the dataset is a result of joining two datasets containing case classes)? Also, what type of dataset do I get from that join? Is it a Dataset[Row]
, or something else?
TL;DR
Encoder
is required to transform the outcome to the internal Spark SQL format and there is no need for that in case offoreach
(or any other sink).Just take a look at the signatures.
map
isso in plain words it transforms records from
T
toU
and then uses theEncoder
ofU
to transform the result to internal representation.foreach
from the other hand, isIn other words it doesn't expect any result. Since there is no result to be stored,
Encoder
is just obsolete.