How to drop or skip data type mismatch while reading from Mongo using Spark Mongo Connector

Question

How to drop or skip data type mismatch while reading from Mongo using Spark Mongo Connector

44 views Asked by Rahul At 04 March 2024 at 15:40

I am trying to read from Mongo using spark Mongo connector trying to load 100M+ rows.

Does anyone know how I can ignore rows if there is a datatype mismatch using my predefined schema? There are some date fields that are timestamps but someone inserted String type. There are similar issues with other fields.

I want to know how we can handle datatype mismatch, The DROPMALFORMED functionality is not available in Spark Mongo Connector.

Any advice or pointers will be helpful.

spark-mongo-connector 10.1.1, delta 2.3, Spark EMR 3.3

For example, there is a field1 which is the Date type in Mongo and one of the records is inserted as a String type of date or wrong value, and field2 is in Integer type but again one or more records are inserted as String and so on.

Original Q&A

There are 1 answers

**ghowkay** · Answer 1 · 2024-03-04T23:19:22+00:00

You could try something like below:

import org.apache.spark.sql.functions._
import spark.implicits._

val df = spark.read.format("mongo").load()

val correctedDf = df
  .withColumn("field1", when($"field1".cast("date").isNotNull, $"field1".cast("date"))
                           .otherwise(lit(null))) // Replace with appropriate default for your use-cas
    ```

TechQA.

How to drop or skip data type mismatch while reading from Mongo using Spark Mongo Connector

There are 1 answers

Related Questions in MONGODB

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in APACHE-SPARK-SQL

Related Questions in DELTA-LAKE

Popular Questions

Trending Questions