I recently moved from Spark 1.6 to Spark 2.X and I would like to move - where possible - from Dataframes to Datasets, as well. I tried a code like this
case class MyClass(a : Any, ...)
val df = ...
df.map(x => MyClass(x.get(0), ...))
As you can see MyClass has a field of type Any, as I do not know at compile time the type of the field I retrieve with x.get(0). It may be a long, string, int, etc.
However, when I try to execute code similar to what you see above, I get an exception:
java.lang.ClassNotFoundException: scala.Any
With some debugging, I realized that the exception is raised, not because my data is of type Any, but because MyClass has a type Any. So how can I use Datasets then?
Unless you're interested in limited and ugly workarounds like
Encoders.kryo:or
you don't. All fields / columns in a
Datasethave to be of known, homogeneous type for which there is an implicitEncoderin the scope. There is simply no place forAnythere.UDT API provides a bit more flexibility and allows for a limited polymorphism but it is private, not fully compatible with
DatasetAPI and comes with significant performance and storage penalty.If for a given execution all values of the same type you can of course create specialized classes and make a decision which one to use at run time.