I want to order by timestamp some avro files that I retrieve from HDFS.
The schema of my avro files is :
headers : Map[String,String], body : String
Now the tricky part is that the timestamp is one of the key/value from the map. So I have the timestamp contained in the map like this :
key_1 -> value_1, key_2 -> value_2, timestamp -> 1234567, key_n -> value_n
Note that the type of the values is String.
I created a case class to create my dataset with this schema :
case class Root(headers : Map[String,String], body: String)
Creation of my dataset :
val ds = spark .read .format("com.databricks.spark.avro") .load(pathToHDFS) .as[Root]
I don't really know how to begin with this problem since I can only get the columns headers and body. How can I get the nested values to finally sort by timestamp ?
I would like to do something like this :
A little precision : I don't want to loose any data from my initial dataset, just a sorting operation.
I use Spark 2.3.0.