How to apply custom logics on spark dataframe using scala

46 views Asked by At

Imagine a data for this question is in a nested json structure. I have flattened the data from json using explode() and added it into one data-frame with columns project, Task, Task-Evidence, Task-Remarks, Project-Evidence

*Note: This DF is having 1 Project for which it has 2 Tasks, for the first task it has 1 task-link, for the second task it has 1 task-link. at project level we have 3 project-links.

Result of DF

Expected Result

1

There are 1 answers

2
Islam Elbanna On

AFAIU, If you flattened the json, then you just need to group tasks, task-evidence, ... by Project, so you can group by project and use collect_set, something thing like this:

import org.apache.spark.sql.functions._

val df2 = df.groupBy("project").agg(
            collect_set("Task").as("Tasks"),
            collect_set("Task-Evidence").as("Task-Evidences"),
            collect_set("Task-Remarks").as("Task-Remarks"),
            collect_set("Project-Evidence").as("Project-Evidences")
        )