How can you apply filter for a RelationalGroupedDataset class from apache.spark.sql using Scala?

351 views Asked by At

I was trying to find a filter function (takes a List type object and a function s.t. the function should be of type of the input list elements and should return a bool value, and the output of the filter of these two functions contains the original list element in which the function returns true on the element).

When I try to apply filter, I get an error. Are there any ways to apply filter to a RelationalGroupedDataset? (I wasn't able to find any in the attached docs: https://spark.apache.org/docs/2.4.4/api/java/org/apache/spark/sql/RelationalGroupedDataset.html)

Also, is there proper notation for how I should be accessing a specific column value for a RelationalGroupedDataset?

Thanks!

Original Call

Error Message

2

There are 2 answers

0
thebluephantom On

Here is is an example:

df.groupBy("department")
  .agg(
    sum("salary").as("sum_salary"),
    avg("salary").as("avg_salary"),
    sum("bonus").as("sum_bonus"),
    max("bonus").as("max_bonus"))
  .where(col("sum_bonus") >= 50000)
  .show(false)

It should give you guidance.

0
Giri On

Try to add :_* to passed cols into groupBy:

def showGroupByDesc(df: DataFrame, cols: Column*): Unit = {
  df.groupBy(cols:_*).count().sort($"count".desc).show()
}

it's a special syntax for passing arguments to varargs functions in scala.

Without :_* compiler is looking for function which accepts Seq[Column] and will not found it.

You can read more about functions with varargs here for example.