GeoToolsSpatialRDDProvider (GeoMesa Spark) extremely inefficient

Question

GeoToolsSpatialRDDProvider (GeoMesa Spark) extremely inefficient

47 views Asked by Prassi At 23 May 2023 at 10:17

I am trying to read roughly 2.5M rows from a GeoPackage file using GeoMesa's Spark integration. Since there is no custom SpatialRDD implemented for this DataStore, I have to use the GeoToolsSpatialRDDProvider. For layers with 10k-100k rows, this works perfectly. However, when I try to load layers with millions of rows, the process never finishes.

I looked into the implementation of GeoToolsSpatialRDDProvider and the issue seems to be that the rdd method in it tries to convert the FeatureReader iterator to a List so that it can be parallelised by the SparkContext, but the conversion never terminates and Spark keeps printing a lot of warnings (mainly Futures timing out), until eventually it runs out of memory.

Is there a way of making this more efficient without increasing Spark driver memory?

Original Q&A

There are 1 answers

**Emilio Lahr-Vivaz** · Answer 1 · 2023-05-24T13:21:04+00:00

The GeoToolsSpatialRDDProvider was intended as an easy way to get data into Spark, but it is not meant for heavy workloads. I don't believe Spark has any way to parallelize data into a Spark cluster from a single thread without loading it into memory first. As a work-around, you could load your data in batches that are small enough to complete, and then take a union of the resulting RDDs.

TechQA.

GeoToolsSpatialRDDProvider (GeoMesa Spark) extremely inefficient

There are 1 answers

Related Questions in APACHE-SPARK

Related Questions in GEOMESA

Popular Questions

Trending Questions