Getting Job aborted due to stage failure while converting my string data in a pyspark dataframe into a dictionary

Question

Getting Job aborted due to stage failure while converting my string data in a pyspark dataframe into a dictionary

95 views Asked by dontgimmehope At 10 November 2023 at 08:37

I have the following data in a pyspark dataframe where both the columns contain string data.

data = [(123, '[{"FLD_NAME":"A","FLD_VAL":"0.1"},{"FLD_NAME":"B","FLD_VAL":"0.2"},{"FLD_NAME":"C","FLD_VAL":"0.3"},{"FLD_NAME":"D","FLD_VAL":"0.4"}]')]
ar = spark.createDataFrame(data, ['id', 'val'])

id	val
123	[{"FLD_NAME":"A","FLD_VAL":"0.1"},{"FLD_NAME":"B","FLD_VAL":"0.2"},{"FLD_NAME":"C","FLD_VAL":"0.3"},{"FLD_NAME":"D","FLD_VAL":"0.4"}]

Now, my aim is to transform the string data from val column to dictionary data. For example:

{
 "A": 0.1,
 "B": 0.2,
 "C": 0.3,
 "D": 0.4
}

So, the data looks like the following:

id	val
123	{"A": 0.1, "B": 0.2, "C": 0.3,"D": 0.4}

Note: I also have to convert the data from FLD_VAL to be decimal.

I have tried the following code:

def func(rows) :
  lp= { row['FLD_NAME'] : row['FLD_VAL'] for row in rows }
  return lp

arr = ar\
    .rdd\
    .map(lambda row: (row[0], func(row[1])))\
    .groupByKey()\
    .toDF(["id","val"])

This code is throwing the following error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 15 in stage 20.0 failed 4 times, most recent failure: Lost task 15.3 in stage 20.0 (TID 186) (10.5.152.101 executor 0): org.apache.spark.api.python.PythonException: &#39;TypeError: string indices must be integers&#39;, from &lt;

Original Q&A

There are 2 answers

**user238607** · Answer 1 · 2023-11-10T13:45:02+00:00

Here's one way to do it. Basically read the string as json, explore the array, extract out the individual element, create map out of the individual lists.

I am sure this can be done better using transform. But this is more self-explanatory.

import sys

from pyspark import SparkContext
from pyspark.sql.types import *
from pyspark.sql import functions as F

from pyspark import SQLContext
import ast

sc = SparkContext('local')
sqlContext = SQLContext(sc)

data = [(123, '[{"FLD_NAME":"A","FLD_VAL":"0.1"},{"FLD_NAME":"B","FLD_VAL":"0.2"},{"FLD_NAME":"C","FLD_VAL":"0.3"},{"FLD_NAME":"D","FLD_VAL":"0.4"}]')]
df1 = sqlContext.createDataFrame(data, ['id', 'val'])



df1.show(n=10, truncate=False)
print("Collect columns into list")

my_schema =  ArrayType( MapType(StringType(), StringType()))


intermediate_df = df1.withColumn("list_dict", F.from_json("val", schema=my_schema)).drop("val")

print("intermediate_df dataframe")
intermediate_df.show(n=20, truncate=False)

intermediate_df =  intermediate_df.withColumn("dict_ele", F.explode(F.col("list_dict"))).drop("list_dict")

intermediate_df =  intermediate_df.withColumn("FLD_NAME", F.col("dict_ele").getField("FLD_NAME")) \
                   .withColumn("FLD_VAL", F.col("dict_ele").getField("FLD_VAL").cast(FloatType())) \
                               .drop("dict_ele")

print("intermediate_df dataframe")
intermediate_df.show(n=20, truncate=False)

intermediate_df  = intermediate_df.groupby("id").agg(F.map_from_arrays(F.collect_list("FLD_NAME"),F.collect_list("FLD_VAL")))

print("intermediate_df dataframe")
intermediate_df.show(n=20, truncate=False)

Output :

+---+-------------------------------------------------------------------------------------------------------------------------------------+
|id |val                                                                                                                                  |
+---+-------------------------------------------------------------------------------------------------------------------------------------+
|123|[{"FLD_NAME":"A","FLD_VAL":"0.1"},{"FLD_NAME":"B","FLD_VAL":"0.2"},{"FLD_NAME":"C","FLD_VAL":"0.3"},{"FLD_NAME":"D","FLD_VAL":"0.4"}]|
+---+-------------------------------------------------------------------------------------------------------------------------------------+

Collect columns into list
intermediate_df dataframe
+---+------------------------------------------------------------------------------------------------------------------------------------+
|id |list_dict                                                                                                                           |
+---+------------------------------------------------------------------------------------------------------------------------------------+
|123|[{FLD_NAME -> A, FLD_VAL -> 0.1}, {FLD_NAME -> B, FLD_VAL -> 0.2}, {FLD_NAME -> C, FLD_VAL -> 0.3}, {FLD_NAME -> D, FLD_VAL -> 0.4}]|
+---+------------------------------------------------------------------------------------------------------------------------------------+

intermediate_df dataframe
+---+--------+-------+
|id |FLD_NAME|FLD_VAL|
+---+--------+-------+
|123|A       |0.1    |
|123|B       |0.2    |
|123|C       |0.3    |
|123|D       |0.4    |
+---+--------+-------+

intermediate_df dataframe
+---+--------------------------------------------------------------+
|id |map_from_arrays(collect_list(FLD_NAME), collect_list(FLD_VAL))|
+---+--------------------------------------------------------------+
|123|{A -> 0.1, B -> 0.2, C -> 0.3, D -> 0.4}                      |
+---+--------------------------------------------------------------+

**Shubham Sharma** · Answer 2 · 2023-11-10T17:04:52+00:00

Parse the strings in the val column as array of structs then use map_from_entries to convert array of structs into mapping of key-val pairs

val = F.from_json('val', schema='array<struct<FLD_NAME string, FLD_VAL string>>')
val = F.map_from_entries(val).cast('map<string, float>')
result = ar.withColumn('val', val)

+---+----------------------------------------+
|id |val                                     |
+---+----------------------------------------+
|123|{A -> 0.1, B -> 0.2, C -> 0.3, D -> 0.4}|
+---+----------------------------------------+

TechQA.

Getting Job aborted due to stage failure while converting my string data in a pyspark dataframe into a dictionary

There are 2 answers

Related Questions in PYTHON

Related Questions in DATAFRAME

Related Questions in PYSPARK

Related Questions in DATABRICKS

Related Questions in RDD

Popular Questions

Trending Questions