I have a collection of 1,000 documents on a standalone Mongo DB with an average document size of 1.04Mb. I am using mongo-spark connector v10.1.1 to read the data by using aggregate pipeline $sample as 120. “_id” is the default generated object id.
spark.read.format("mongodb").options(configMap).load().toJSON
The config map in the above code snippet is as follows:
(partitioner,com.mongodb.spark.sql.connector.read.partitioner.PaginateBySizePartitioner) (aggregation.pipeline,[{“$sample”: {“size”: 120}}]) (partitioner.options.partition.field,_id) (connection.uri,mongodb://x:x@localhost:27017/mydb.mydata?readPreference=primaryPreferred&authSource=admin&authMechanism=SCRAM-SHA-1) (partitioner.options.partition.size,64)
The result I am getting is as follows:
Please help me understand this behavior. Why does the connector create 5-6 partitions every time in this case? How is it calculating or ending up with 5-6 partitions in this case? I was expecting two partitions to be created for 120 sample documents since per partition size is set to 64mb.
The actual problem is that I want to read 120 random documents here. But unfortunately, it applies $random per partition and the result count gets multiplied. Please suggest solution for this guys!
I tried fetching random data using $sample through spark connector. I need to understand why specifically 5/6 partitions are getting created?