I am trying to read from Mongo using spark Mongo connector trying to load 100M+ rows.
Does anyone know how I can ignore rows if there is a datatype mismatch using my predefined schema? There are some date fields that are timestamps but someone inserted String type. There are similar issues with other fields.
I want to know how we can handle datatype mismatch, The DROPMALFORMED functionality is not available in Spark Mongo Connector.
Any advice or pointers will be helpful.
spark-mongo-connector 10.1.1, delta 2.3, Spark EMR 3.3
For example, there is a field1 which is the Date type in Mongo and one of the records is inserted as a String type of date or wrong value, and field2 is in Integer type but again one or more records are inserted as String and so on.
You could try something like below: