Spark, Connect, could not cache persist jdbc_relation

75 views Asked by At

I am trying to test spark-connect from python. Spark version is 3.5.0. I deployed standalone spark-cluster on few nodes and deploy spark-connect. When i do:


In [133]: df = spark.range(15)

In [134]: df.is_cached
Out[134]: False

In [135]: df.cache()
Out[135]: DataFrame[id: bigint]

In [136]: df.is_cached
Out[136]: True

it works fine.

But when i use jdbc-relation .cache() or .persist are ignored.


def fetch_table(spark, table_name: str):
    return (
        spark.read
        .format("jdbc")
        .option("url", PG_DSN)
        .option('user', options["user"])
        .option('password', options["password"])
        .option("driver", "org.postgresql.Driver")
        .option("fetchsize", 1_000_000)
        .option('dbtable', table_name)
        .load()
    )

loc = fetch_table(spark, 'loc')

In [137]: loc = fetch_table(spark, 'loc')

In [138]: loc.is_cached
Out[138]: False

In [139]: loc.cache()
Out[139]: DataFrame[code: string, loc_type: int]

In [140]: loc.is_cached
Out[140]: False

In [141]: loc.persist()
Out[141]: DataFrame[code: string, loc_type: int]

In [142]: loc.storageLevel
Out[142]: StorageLevel(False, False, False, False, 1)

When i tested it without spark-connect it works fine. Cache and persist marked as supported by spark-connect.

Any ideas why doesnot it work?

0

There are 0 answers