PySpark: count over a window with reset

Question

PySpark: count over a window with reset

62 views Asked by jakeis At 24 February 2024 at 10:39

I have a PySpark DataFrame which looks like this:

df = spark.createDataFrame(
    data=[
    (1, "GERMANY", "20230606", True),
    (2, "GERMANY", "20230620", False),
    (3, "GERMANY", "20230627", True),
    (4, "GERMANY", "20230705", True),
    (5, "GERMANY", "20230714", False),
    (6, "GERMANY", "20230715", True),
    ],
    schema=["ID", "COUNTRY", "DATE", "FLAG"]
)
df.show()

+---+-------+--------+-----+
| ID|COUNTRY|    DATE| FLAG|
+---+-------+--------+-----+
|  1|GERMANY|20230606| true|
|  2|GERMANY|20230620|false|
|  3|GERMANY|20230627| true|
|  4|GERMANY|20230705| true|
|  5|GERMANY|20230714|false|
|  6|GERMANY|20230715| true|
+---+-------+--------+-----+

The DataFrame has more countries. I want to create a new column COUNT_WITH_RESET following the logic:

If FLAG=False, then COUNT_WITH_RESET=0.
If FLAG=True, then COUNT_WITH_RESET should count the number of rows starting from the previous date where FLAG=False for that specific country.

This should be the output for the example above.

+---+-------+--------+-----+----------------+
| ID|COUNTRY|    DATE| FLAG|COUNT_WITH_RESET|
+---+-------+--------+-----+----------------+
|  1|GERMANY|20230606| true|               1|
|  2|GERMANY|20230620|false|               0|
|  3|GERMANY|20230627| true|               1|
|  4|GERMANY|20230705| true|               2|
|  5|GERMANY|20230714|false|               0|
|  6|GERMANY|20230715| true|               1|
+---+-------+--------+-----+----------------+

I have tried with row_number() over a window but I can't manage to reset the count. I have also tried with .rowsBetween(Window.unboundedPreceding, Window.currentRow). Here's my approach:

from pyspark.sql.window import Window
import pyspark.sql.functions as F

window_reset = Window.partitionBy("COUNTRY").orderBy("DATE")

df_with_reset = (
    df
    .withColumn("COUNT_WITH_RESET", F.when(~F.col("FLAG"), 0)
                .otherwise(F.row_number().over(window_reset)))
)

df_with_reset.show()

+---+-------+--------+-----+----------------+
| ID|COUNTRY|    DATE| FLAG|COUNT_WITH_RESET|
+---+-------+--------+-----+----------------+
|  1|GERMANY|20230606| true|               1|
|  2|GERMANY|20230620|false|               0|
|  3|GERMANY|20230627| true|               3|
|  4|GERMANY|20230705| true|               4|
|  5|GERMANY|20230714|false|               0|
|  6|GERMANY|20230715| true|               6|
+---+-------+--------+-----+----------------+

This is obviously wrong as my window is partitioning only by country, but am I on the right track? Is there a specific built-in function in PySpark to achieve this? Do I need a UDF? Any help would be appreciated.

Original Q&A

There are 2 answers

user238607 On 26 February 2024 at 10:18

I modified my answer from here https://stackoverflow.com/a/78056548/3238085 to this problem setup.

import sys

from pyspark.sql import Window
from pyspark import SQLContext
from pyspark.sql.functions import *
import pyspark.sql.functions as F

spark = SparkContext('local')
sqlContext = SQLContext(spark)

sample_df = sqlContext.createDataFrame(
    data=[
        (1, "GERMANY", "20230606", True),
        (2, "GERMANY", "20230620", False),
        (3, "GERMANY", "20230627", True),
        (4, "GERMANY", "20230705", True),
        (5, "GERMANY", "20230714", False),
        (6, "GERMANY", "20230715", True),
    ],
    schema=["ID", "COUNTRY", "DATE", "FLAG"]
)

sample_df.show(100, truncate=False)

windowSpec = Window.partitionBy("COUNTRY").orderBy("id").rowsBetween(Window.unboundedPreceding, Window.currentRow)
sample_df = sample_df.withColumn('FLAGLIST', F.collect_list('FLAG').over(windowSpec))

initial_value = F.lit(0)

sample_df = sample_df.withColumn('COUNT_WITH_RESET', aggregate("FLAGLIST", initial_value,
                                                               lambda acc, x:  F.when( x == True,  acc + 1).otherwise(0)))

sample_df.show(truncate=False)

OUTPUT :

+---+-------+--------+-----+
|ID |COUNTRY|DATE    |FLAG |
+---+-------+--------+-----+
|1  |GERMANY|20230606|true |
|2  |GERMANY|20230620|false|
|3  |GERMANY|20230627|true |
|4  |GERMANY|20230705|true |
|5  |GERMANY|20230714|false|
|6  |GERMANY|20230715|true |
+---+-------+--------+-----+

+---+-------+--------+-----+--------------------------------------+----------------+
|ID |COUNTRY|DATE    |FLAG |FLAGLIST                              |COUNT_WITH_RESET|
+---+-------+--------+-----+--------------------------------------+----------------+
|1  |GERMANY|20230606|true |[true]                                |1               |
|2  |GERMANY|20230620|false|[true, false]                         |0               |
|3  |GERMANY|20230627|true |[true, false, true]                   |1               |
|4  |GERMANY|20230705|true |[true, false, true, true]             |2               |
|5  |GERMANY|20230714|false|[true, false, true, true, false]      |0               |
|6  |GERMANY|20230715|true |[true, false, true, true, false, true]|1               |
+---+-------+--------+-----+--------------------------------------+----------------+

**Shubham Sharma** · Accepted Answer · 2024-02-24T11:08:58+00:00

Partition the dataframe by COUNTRY then calculate the cumulative sum over the inverted FLAG column to assign group numbers in order to distinguish between different blocks of rows which start with false

W1 = Window.partitionBy('COUNTRY').orderBy('DATE')
df1 = df.withColumn('blocks', F.sum((~F.col('FLAG')).cast('long')).over(W1))

df1.show()
# +---+-------+--------+-----+------+
# | ID|COUNTRY|    DATE| FLAG|blocks|
# +---+-------+--------+-----+------+
# |  1|GERMANY|20230606| true|     0|
# |  2|GERMANY|20230620|false|     1|
# |  3|GERMANY|20230627| true|     1|
# |  4|GERMANY|20230705| true|     1|
# |  5|GERMANY|20230714|false|     2|
# |  6|GERMANY|20230715| true|     2|
# +---+-------+--------+-----+------+

Partition the dataframe by COUNTRY along with blocks then calculate row number over the ordered partition to create sequential counter

W2 = Window.partitionBy('COUNTRY', 'blocks').orderBy('DATE')
df1 = df1.withColumn('COUNT_WITH_RESET', F.row_number().over(W2) - 1)


df1.show()
# +---+-------+--------+-----+------+----------------+
# | ID|COUNTRY|    DATE| FLAG|blocks|COUNT_WITH_RESET|
# +---+-------+--------+-----+------+----------------+
# |  1|GERMANY|20230606| true|     0|               0|
# |  2|GERMANY|20230620|false|     1|               0|
# |  3|GERMANY|20230627| true|     1|               1|
# |  4|GERMANY|20230705| true|     1|               2|
# |  5|GERMANY|20230714|false|     2|               0|
# |  6|GERMANY|20230715| true|     2|               1|
# +---+-------+--------+-----+------+----------------+

TechQA.

PySpark: count over a window with reset

There are 2 answers

Related Questions in PYTHON

Related Questions in PYSPARK

Related Questions in COUNT

Related Questions in SPARK-WINDOW-FUNCTION

Popular Questions

Trending Questions