How to avoid being struct column name written to the json file?

51 views Asked by At

How to avoid being struct column name written to a json file? While writing the df to the json file?

Using databricks pyspark write method.

Df.write.option("header", "false").mode("overwrite).json(path)

Tried option("header", "false")

Sample json file: {"struct_col_name":{"actual_struct_data_col":"values"....}}

Need to avoid first root key column struct_col_name.

Sample dataframe/ schema Sample dataframe picture

PrintSchema picture

3

There are 3 answers

0
ShaikMaheer On

You cannot do that directly from dataframe.

You can consider, write json first and then having python code to read that json file and replace your column names with empty string.

0
Bhavani On

You can follow the procedure below to get the required format:

Here is a sample JSON, which is the JSON format of a data frame:

{"id":2,"name":"Alice","properties":{"age":"25","gender":"Female"}} 
{"id":1,"name":"John","properties":{"age":"30","gender":"Male"}}

Read the JSON and flatten its structure in your Spark Data Frame, use the select function along with the alias function to rename the properties as required. Use the following code:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, struct
json_df = spark.read.json("<jsonPath>/data.json")

flattened_df = json_df.select(
    col("id"),
    col("name"),
    col("properties.age").alias("age"),
    col("properties.gender").alias("gender")
)

To see the flattened JSON, use the code below:

json_string = flattened_df.toJSON().collect()

# Print JSON string
for row in json_string:
    print(row)

Output:

{"id":2,"name":"Alice","age":"25","gender":"Female"} 
{"id":1,"name":"John","age":"30","gender":"Male"}

Write the DataFrame into JSON format using the following code:

flattened_df.write.option("header", "false").mode("overwrite").json("<jsonPath>/data.json")
0
Vikas Sharma On

Try this:

df_with_struct.select("final_struct_df.*").write.mode("overwrite").json(path)

Note: I am guessing you are just trying to write a flattened version of the dataframe. In that case, the aforementioned piece of code should work. If not, please let me know your required output by updating your question or in the comments to this answer.