I am trying to read roughly 2.5M rows from a GeoPackage file using GeoMesa's Spark integration. Since there is no custom SpatialRDD implemented for this DataStore, I have to use the GeoToolsSpatialRDDProvider. For layers with 10k-100k rows, this works perfectly. However, when I try to load layers with millions of rows, the process never finishes.
I looked into the implementation of GeoToolsSpatialRDDProvider and the issue seems to be that the rdd method in it tries to convert the FeatureReader iterator to a List so that it can be parallelised by the SparkContext, but the conversion never terminates and Spark keeps printing a lot of warnings (mainly Futures timing out), until eventually it runs out of memory.
Is there a way of making this more efficient without increasing Spark driver memory?
The
GeoToolsSpatialRDDProviderwas intended as an easy way to get data into Spark, but it is not meant for heavy workloads. I don't believe Spark has any way to parallelize data into a Spark cluster from a single thread without loading it into memory first. As a work-around, you could load your data in batches that are small enough to complete, and then take a union of the resulting RDDs.