GeoToolsSpatialRDDProvider (GeoMesa Spark) extremely inefficient

47 views Asked by At

I am trying to read roughly 2.5M rows from a GeoPackage file using GeoMesa's Spark integration. Since there is no custom SpatialRDD implemented for this DataStore, I have to use the GeoToolsSpatialRDDProvider. For layers with 10k-100k rows, this works perfectly. However, when I try to load layers with millions of rows, the process never finishes.

I looked into the implementation of GeoToolsSpatialRDDProvider and the issue seems to be that the rdd method in it tries to convert the FeatureReader iterator to a List so that it can be parallelised by the SparkContext, but the conversion never terminates and Spark keeps printing a lot of warnings (mainly Futures timing out), until eventually it runs out of memory.

Is there a way of making this more efficient without increasing Spark driver memory?

1

There are 1 answers

0
Emilio Lahr-Vivaz On

The GeoToolsSpatialRDDProvider was intended as an easy way to get data into Spark, but it is not meant for heavy workloads. I don't believe Spark has any way to parallelize data into a Spark cluster from a single thread without loading it into memory first. As a work-around, you could load your data in batches that are small enough to complete, and then take a union of the resulting RDDs.