Below is my dataset.
user,device,time_spent,video_start
userA,mob,5,1
userA,desk,5,2
userA,desk,5,3
userA,mob,5,2
userA,mob,5,2
userB,desk,5,2
userB,mob,5,2
userB,mob,5,2
userB,desk,5,2
I want to find out below aggregation for each user.
user total_time_spent device_distribution
userA 20 {mob:60%,desk:40%}
userB 20 {mob:50%,desk:50%}
Can someone help me to achieve this using spark 2.0 API preferably in Java. I have tried using UserDefinedAggregateFunction but it doesn't support group within group as I have to group each user group by device to find aggregated time spent on each device.
Here the
pivot
function is pretty useful. An article from Databricks on the subject. For the code (sorry it's Scala but that shouldn't be a big problem to translate it to Java):NB: with the
pivot
function you need an aggregation function. Here since there is only one value by device, you can simply usefirst
.The
device_distribution
column format isn't exactly what you're looking for but:case class
when saving your output data in a json format for instance, this will have exactly the format you want.