Get n random documents from mongo using spark connector 10.1.1

60 views Asked by At

I have a collection of 1,000 documents on a standalone Mongo DB with an average document size of 1.04Mb. I am using mongo-spark connector v10.1.1 to read the data by using aggregate pipeline $sample as 120. “_id” is the default generated object id.

spark.read.format("mongodb").options(configMap).load().toJSON

The config map in the above code snippet is as follows:

(partitioner,com.mongodb.spark.sql.connector.read.partitioner.PaginateBySizePartitioner) (aggregation.pipeline,[{“$sample”: {“size”: 120}}]) (partitioner.options.partition.field,_id) (connection.uri,mongodb://x:x@localhost:27017/mydb.mydata?readPreference=primaryPreferred&authSource=admin&authMechanism=SCRAM-SHA-1) (partitioner.options.partition.size,64)

The result I am getting is as follows:

Results

Please help me understand this behavior. Why does the connector create 5-6 partitions every time in this case? How is it calculating or ending up with 5-6 partitions in this case? I was expecting two partitions to be created for 120 sample documents since per partition size is set to 64mb.

The actual problem is that I want to read 120 random documents here. But unfortunately, it applies $random per partition and the result count gets multiplied. Please suggest solution for this guys!

I tried fetching random data using $sample through spark connector. I need to understand why specifically 5/6 partitions are getting created?

0

There are 0 answers