How to fill the null value in dataframe to uuid?

Question

How to fill the null value in dataframe to uuid?

1.3k views Asked by Robin Wang At 26 December 2016 at 11:13

There is a dataframe with null values in one column(not all being null), it need to fill the null value with uuid, is there a way?

cala> val df = Seq(("stuff2",null,null), ("stuff2",null,Array("value1","value2")),("stuff3","stuff3",null)).toDF("field","field2","values")
        df: org.apache.spark.sql.DataFrame = [field: string, field2: string, values: array<string>]

        scala> df.show
        +------+------+----------------+
        | field|field2|          values|
        +------+------+----------------+
        |stuff2|  null|            null|
        |stuff2|  null|[value1, value2]|
        |stuff3|stuff3|            null|
        +------+------+----------------+

I tried this way, but each row of the "field2" has the same uuid.

scala> val fillDF = df.na.fill(java.util.UUID.randomUUID().toString(), Seq("field2"))
    fillDF: org.apache.spark.sql.DataFrame = [field: string, field2: string, values: array<string>]

scala> fillDF.show
+------+--------------------+----------------+
| field|              field2|          values|
+------+--------------------+----------------+
|stuff2|d007ffae-9134-4ac...|            null|
|stuff2|d007ffae-9134-4ac...|[value1, value2]|
|stuff3|              stuff3|            null|
+------+--------------------+----------------+

How to make it? in case there is more than 1,000,000 rows

Original Q&A

There are 3 answers

Shivansh On 26 December 2016 at 11:54

You can easily do this by using UDF , it can be something like this :

  def generateUUID(value: String):String = {
    import java.util.UUID
    if (Option(value).isDefined) {
      value
    }
    else {
      UUID.randomUUID().toString
    }
    val funcUDF = generateUUID _
    val generateUUID = udf(funcUDF)

Now pass the fillDF accrodingly:

fillDF.withColumns("field2",generateUUID(fillDF("field2"))).show

P.S: The code is not tested but it should work !

Ben On 05 September 2023 at 14:59

This is more or less the same as the above answers, except that it avoids using a UDF. Perhaps at the time there was no uuid() function available in sparkSQL? In any case, I think this is likely to be more performant, and, imo, easier to read

import org.apache.spark.sql.functions.{col, coalesce, expr}

val updatedDF = df.withColumn("nullable_column", coalesce(col("nullable_column"), expr("uuid()")))

**abaghel** · Accepted Answer · 2016-12-26T11:55:23+00:00

You can do it using UDF and coalesce like below.

import org.apache.spark.sql.functions.udf
val arr = udf(() => java.util.UUID.randomUUID().toString())

val df2 = df.withColumn("field2", coalesce(df("field2"), arr()))
df2.show()

You will get different UUID like below.

+------+--------------------+----------------+
| field|              field2|          values|
+------+--------------------+----------------+
|stuff2|fda6bc42-1265-407...|            null|
|stuff2|3fa74767-abd7-405...|[value1, value2]|
|stuff3|              stuff3|            null|
+------+--------------------+----------------+

TechQA.

How to fill the null value in dataframe to uuid?

There are 3 answers

Related Questions in SCALA

Related Questions in APACHE-SPARK

Related Questions in APACHE-SPARK-SQL

Related Questions in APACHE-SPARK-DATASET

Popular Questions

Popular Tags

Trending Questions