Explode Array and many sub-array(s) in Pyspark

52 views Asked by At

Explode array of structs to columns in pyspark

enter image description here

Should be transformed to

enter image description here

How can I do this ?

I am unable to resolve from hits arrays to ajax array. Is there any way to explode second array from the hits array?

1

There are 1 answers

0
YuriR On

As there is no expected input and expected output and not clear what have you tried so far, i will publish my solution that has input and output, which should give you general idea how to approach that.

from pyspark.sql.functions import explode

input = {
  "id": 1,
  "cookie": "my cookie",
  "ts": 123,
  "day": 2,
  "hour": 3,
  "hits": [
    {
      "id": 4,
      "session_id": 5,
      "ts": 6,
      "ajax": [
        {
          "id": 7,
          "hit_id": 8,
          "session_id": 9,
          "ts": 124,
          "day": 10,
          "hour": 11,
          "domain_id": 12,
          "domain": "my domain",
          "path_id": 13,
          "path": "my path",
          "response_server_ms": 14,
          "response_size": 15,
          "response_code": 16
        }
      ],
      "events": [
        {
          "id": 17,
          "hit_id": 18,
          "session_id": 19,
          "event_id": 20,
          "ts": 1246,
          "event": "my event",
          "value": "my value"
        },
        {
          "id": 27,
          "hit_id": 28,
          "session_id": 29,
          "event_id": 21,
          "ts": 12467,
          "event": "my event",
          "value": "my value"
        }
      ]
    }
  ]
}

df = spark.read.json(sc.parallelize([input])).selectExpr('inline(hits)').select('events').select(explode('events').alias('events'))
df.show(truncate=False)
df.printSchema()

And the output is:

>>> df.show(truncate=False)                                                     
+-------------------------------------------+
|events                                     |
+-------------------------------------------+
|{my event, 20, 18, 17, 19, 1246, my value} |
|{my event, 21, 28, 27, 29, 12467, my value}|
+-------------------------------------------+

>>> df.printSchema()
root
 |-- events: struct (nullable = true)
 |    |-- event: string (nullable = true)
 |    |-- event_id: long (nullable = true)
 |    |-- hit_id: long (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- session_id: long (nullable = true)
 |    |-- ts: long (nullable = true)
 |    |-- value: string (nullable = true)