I recently moved from Spark 1.6 to Spark 2.X and I would like to move - where possible - from Dataframes to Datasets, as well. I tried a code like this
case class MyClass(a : Any, ...)
val df = ...
df.map(x => MyClass(x.get(0), ...))
As you can see MyClass
has a field of type Any
, as I do not know at compile time the type of the field I retrieve with x.get(0)
. It may be a long, string, int, etc.
However, when I try to execute code similar to what you see above, I get an exception:
java.lang.ClassNotFoundException: scala.Any
With some debugging, I realized that the exception is raised, not because my data is of type Any
, but because MyClass
has a type Any
. So how can I use Datasets then?
Unless you're interested in limited and ugly workarounds like
Encoders.kryo
:or
you don't. All fields / columns in a
Dataset
have to be of known, homogeneous type for which there is an implicitEncoder
in the scope. There is simply no place forAny
there.UDT API provides a bit more flexibility and allows for a limited polymorphism but it is private, not fully compatible with
Dataset
API and comes with significant performance and storage penalty.If for a given execution all values of the same type you can of course create specialized classes and make a decision which one to use at run time.