I want to order by timestamp some avro files that I retrieve from HDFS.
The schema of my avro files is :
headers : Map[String,String], body : String
Now the tricky part is that the timestamp is one of the key/value from the map. So I have the timestamp contained in the map like this :
key_1 -> value_1, key_2 -> value_2, timestamp -> 1234567, key_n -> value_n
Note that the type of the values is String.
I created a case class to create my dataset with this schema :
case class Root(headers : Map[String,String], body: String)
Creation of my dataset :
val ds = spark
.read
.format("com.databricks.spark.avro")
.load(pathToHDFS)
.as[Root]
I don't really know how to begin with this problem since I can only get the columns headers and body. How can I get the nested values to finally sort by timestamp ?
I would like to do something like this :
ds.select("headers").doSomethingToGetTheMapStructure.doSomeConversionStringToTimeStampForTheColumnTimeStamp("timestamp").orderBy("timestamp")
A little precision : I don't want to loose any data from my initial dataset, just a sorting operation.
I use Spark 2.3.0.
The loaded
Dataset
should look something similar to the sample dataset below:You can simply look up the
Map
by thetimestamp
key,cast
the value toLong
, and perform anorderBy
as follows:Note that
$"headers"("timestamp")
is just the same as using theapply
column method (i.e.$"headers".apply("timestamp")
).Alternatively, you could also use
getItem
to access theMap
by key, like: