DataType (UDT) v.s. Encoder in Spark SQL

368 views Asked by At

In Spark SQL, there're limited DataTypes for Schema, and there're limited Encoders for converting JVM objects to and from the internal Spark SQL representation.

  • In practice, we may have errors like this regarding DataType, which usually happens in a DataFrame with custom types, BUT NOT in a Dataset[T] with custom types. Discussion of DataType (or UDT) points to How to define schema for custom type in Spark SQL?

java.lang.UnsupportedOperationException: Schema for type xxx is not supported

  • In practice, we may have errors like this regarding Encoder, which usually happens in a Dataset[T] with custom types, BUT NOT in a DataFrame with custom types. Discussion of Encoder points to How to store custom objects in Dataset?

Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases

In my understanding, both touches the internal Spark SQL optimizer (which is why only a limited number of DataTypes and Encoders are provided); and both DataFrame and Dataset are just Dataset[A], then..

Question (or more.. confusion)

  • Why first error only appears in DataFrame but not in Dataset[T]? Same question for the second error...

  • Can creating UDT solve the 2nd error? Can creating encoders solve the 1st error?

  • How should I understand the relation between each, and how do they interact with Dataset or Spark SQL engine?

The initiative of this post is to explore more in the two concepts and to attract open discussion, so please bear a bit if the questions are not too specific.. and any sharing of understanding is appreciated. Thanks.

0

There are 0 answers