How to drop or skip data type mismatch while reading from Mongo using Spark Mongo Connector

44 views Asked by At

I am trying to read from Mongo using spark Mongo connector trying to load 100M+ rows.

Does anyone know how I can ignore rows if there is a datatype mismatch using my predefined schema? There are some date fields that are timestamps but someone inserted String type. There are similar issues with other fields.

I want to know how we can handle datatype mismatch, The DROPMALFORMED functionality is not available in Spark Mongo Connector.

Any advice or pointers will be helpful.

spark-mongo-connector 10.1.1, delta 2.3, Spark EMR 3.3

For example, there is a field1 which is the Date type in Mongo and one of the records is inserted as a String type of date or wrong value, and field2 is in Integer type but again one or more records are inserted as String and so on.

1

There are 1 answers

1
ghowkay On

You could try something like below:

import org.apache.spark.sql.functions._
import spark.implicits._

val df = spark.read.format("mongo").load()

val correctedDf = df
  .withColumn("field1", when($"field1".cast("date").isNotNull, $"field1".cast("date"))
                           .otherwise(lit(null))) // Replace with appropriate default for your use-cas
    ```