Possible to add own method to the Spark Connect Thin Client API?

41 views Asked by At

we would like to add a method to the Spark Connect Server and expose on the Thin Client API. Here is why:

In our organization, we have built a timeseries database using Spark and HDFS storage for hundreds of thousands of signals and a few Petabyte of data.

Simplified a bit, the current Python API to extract the data is:

    def getData(startTime, endTime, signalNames: list[str]) -> pyspark.sql.DataFrame:

The implementation of this method selects dozens or hundreds of HDFS files, extracts data from them and loads it into one Spark DataFrame. The logic non-trivial and implemented in Java.

Therefore we would like to deploy this getData() implementation on the Spark Connect server, and expose it on the client side as part of the Thin Client API. This would give our Python users the possibility to further process the extracted DataFrame using the Thin Client DataFrame API.

So far, in the "monolithic" version of Spark, we simply invoke our Java method using the Py4J mechanism to invoke a Java Method from python: spark._jvm.my.timeseries.cli.getData(...) This works for us, and we would like to do the same thing with Spark Connect.

Many thanks in advance for your time and interest, Vito

0

There are 0 answers