Module 'pyspark.sql.functions' has no attribute 'median' error while using pyspark on databricks

246 views Asked by At

I'm not sure why this error is coming up, the only reason i can think of is prehaps if databricks is not running version 3.4.0 on databricks. Then i thought to double check the spark version and lord behold its running on 3.2.1.

My issue is i'm trying to find the medium follower count for every age group, initially i thought it would be as simple as

df = spark.table('global_temp.user').join(spark.table('global_temp.pin'), 'ind')
df = df.withColumn('age_groups', f.when(f.col('age').between(18, 24), '18-24')
                                  .when(f.col('age').between(25, 35), '25-35')
                                  .when(f.col('age').between(35, 50), '35-50')
                                  .when(f.col('age') > 50, '50+'))
df = df.groupBy('age_groups').agg(f.median("follower_count").alias("median_follower_count"))

without the median functionality, i'm not entirely confident i can work on this question, is there any suggestions or things i can have a look at ?

I figured i may be able to do it this way

df.groupBy('age_groups').agg(f.percentile_approx("follower_count", 0.5).alias("median_follower_count"))

Although percentile_approx may not be the same as median because median is essentially says half the group is higher than X and half the group is lower than X

0

There are 0 answers