Row number that resets numbers based on a condition

54 views Asked by At

Hi a have a dataframe with some user_id, the months that they are active, and the lead_month which they're active.I need to perform the calculation of column "active_months" shown at the bellow image, that counts how many consecutives month this user is active. So when it took more than 1 month to this user to be active again, we reset our count starting again at 1.

I can't groupby my data, I need to work as window function, because I have other operations to make at user_id level

Can anyone help me?

enter image description here

I tried a window function with Window().partitionBy(['account_id']).orderBy('reference_month').rowsBetween(Window.unboundedPreceding, Window.currentRow) but it doesn't rest the count to 1

1

There are 1 answers

3
ARCrow On

Hope my test dataframe is covering all the bases:

import pyspark.sql.functions as f
from pyspark.sql.types import *
from pyspark.sql.window import Window

df = spark.createDataFrame([
  (1, '2021-12-01', '2022-01-01'),
  (1, '2022-01-01', '2022-02-01'),
  (1, '2022-02-01', '2022-03-01'),
  (2, '2023-01-01', '2023-03-01'),
  (2, '2023-03-01', '2023-04-01'),
  (2, '2023-04-01', '2023-07-01'),
], ['id', 'reference_month', 'lead_month'])

df = (
  df
  .select('id', f.col('reference_month').cast(DateType()), f.col('lead_month').cast(DateType()))
  .withColumn('delta_lead_months', f.months_between(f.col('lead_month'), f.col('reference_month')))
  .withColumn('active_months', f.count(f.col('reference_month')).over(Window.partitionBy('id').orderBy('reference_month').rowsBetween(Window.unboundedPreceding, Window.currentRow)))
)

df.show(truncate = False)

df.show(truncate = False)

and the output:

+---+---------------+----------+-----------------+-------------+                
|id |reference_month|lead_month|delta_lead_months|active_months|
+---+---------------+----------+-----------------+-------------+
|1  |2021-12-01     |2022-01-01|1.0              |1            |
|1  |2022-01-01     |2022-02-01|1.0              |2            |
|1  |2022-02-01     |2022-03-01|1.0              |3            |
|2  |2023-01-01     |2023-03-01|2.0              |1            |
|2  |2023-03-01     |2023-04-01|1.0              |2            |
|2  |2023-04-01     |2023-07-01|3.0              |3            |
+---+---------------+----------+-----------------+-------------+