I'm not sure why this error is coming up, the only reason i can think of is prehaps if databricks is not running version 3.4.0 on databricks. Then i thought to double check the spark version and lord behold its running on 3.2.1.
My issue is i'm trying to find the medium follower count for every age group, initially i thought it would be as simple as
df = spark.table('global_temp.user').join(spark.table('global_temp.pin'), 'ind')
df = df.withColumn('age_groups', f.when(f.col('age').between(18, 24), '18-24')
.when(f.col('age').between(25, 35), '25-35')
.when(f.col('age').between(35, 50), '35-50')
.when(f.col('age') > 50, '50+'))
df = df.groupBy('age_groups').agg(f.median("follower_count").alias("median_follower_count"))
without the median functionality, i'm not entirely confident i can work on this question, is there any suggestions or things i can have a look at ?
I figured i may be able to do it this way
df.groupBy('age_groups').agg(f.percentile_approx("follower_count", 0.5).alias("median_follower_count"))
Although percentile_approx may not be the same as median because median is essentially says half the group is higher than X and half the group is lower than X