In Spark SQL, there're limited DataType
s for Schema
, and there're limited Encoder
s for converting JVM objects to and from the internal Spark SQL representation.
- In practice, we may have errors like this regarding
DataType
, which usually happens in aDataFrame
with custom types, BUT NOT in aDataset[T]
with custom types. Discussion ofDataType
(orUDT
) points to How to define schema for custom type in Spark SQL?
java.lang.UnsupportedOperationException: Schema for type xxx is not supported
- In practice, we may have errors like this regarding
Encoder
, which usually happens in aDataset[T]
with custom types, BUT NOT in aDataFrame
with custom types. Discussion ofEncoder
points to How to store custom objects in Dataset?
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases
In my understanding, both touches the internal Spark SQL optimizer (which is why only a limited number of DataType
s and Encoder
s are provided); and both DataFrame
and Dataset
are just Dataset[A]
, then..
Question (or more.. confusion)
Why first error only appears in
DataFrame
but not inDataset[T]
? Same question for the second error...Can creating
UDT
solve the 2nd error? Can creating encoders solve the 1st error?How should I understand the relation between each, and how do they interact with
Dataset
or Spark SQL engine?
The initiative of this post is to explore more in the two concepts and to attract open discussion, so please bear a bit if the questions are not too specific.. and any sharing of understanding is appreciated. Thanks.