Below is my dataset.
user,device,time_spent,video_start
userA,mob,5,1
userA,desk,5,2
userA,desk,5,3
userA,mob,5,2
userA,mob,5,2
userB,desk,5,2
userB,mob,5,2
userB,mob,5,2
userB,desk,5,2
I want to find out below aggregation for each user.
user total_time_spent device_distribution
userA 20 {mob:60%,desk:40%}
userB 20 {mob:50%,desk:50%}
Can someone help me to achieve this using spark 2.0 API preferably in Java. I have tried using UserDefinedAggregateFunction but it doesn't support group within group as I have to group each user group by device to find aggregated time spent on each device.
Here the
pivotfunction is pretty useful. An article from Databricks on the subject. For the code (sorry it's Scala but that shouldn't be a big problem to translate it to Java):NB: with the
pivotfunction you need an aggregation function. Here since there is only one value by device, you can simply usefirst.The
device_distributioncolumn format isn't exactly what you're looking for but:case classwhen saving your output data in a json format for instance, this will have exactly the format you want.