I am trying to test spark-connect from python. Spark version is 3.5.0. I deployed standalone spark-cluster on few nodes and deploy spark-connect. When i do:
In [133]: df = spark.range(15)
In [134]: df.is_cached
Out[134]: False
In [135]: df.cache()
Out[135]: DataFrame[id: bigint]
In [136]: df.is_cached
Out[136]: True
it works fine.
But when i use jdbc-relation .cache() or .persist are ignored.
def fetch_table(spark, table_name: str):
return (
spark.read
.format("jdbc")
.option("url", PG_DSN)
.option('user', options["user"])
.option('password', options["password"])
.option("driver", "org.postgresql.Driver")
.option("fetchsize", 1_000_000)
.option('dbtable', table_name)
.load()
)
loc = fetch_table(spark, 'loc')
In [137]: loc = fetch_table(spark, 'loc')
In [138]: loc.is_cached
Out[138]: False
In [139]: loc.cache()
Out[139]: DataFrame[code: string, loc_type: int]
In [140]: loc.is_cached
Out[140]: False
In [141]: loc.persist()
Out[141]: DataFrame[code: string, loc_type: int]
In [142]: loc.storageLevel
Out[142]: StorageLevel(False, False, False, False, 1)
When i tested it without spark-connect it works fine. Cache and persist marked as supported by spark-connect.
Any ideas why doesnot it work?