In continuation with the issue: pyspark dataframe withColumn command not working
I have a input dataframe: df_input (updated df_input)
|comment|inp_col|inp_val|
|11 |a |a1 |
|12 |a |a2 |
|12 |f |&a |
|12 |a |f9 |
|15 |b |b3 |
|16 |b |b4 |
|17 |c |&b |
|17 |c |c5 |
|17 |d |&c |
|17 |d |d6 |
|17 |e |&d |
|17 |e |e7 |
If you see the inp_col and inp_val is having a hierarchy and it can be n number with the root value. Here the parent value are "b" and "a".
Now, as per my requirement I have to replace the child values starting with "&" to its corresponding values. I have tried in iterating over the list of values starting with '&' values in inp_val column and replacing with list of values over every iteration. But, it didn't get worked. I'm facing issue how to get the list with parent and child list values.
tried code:
list_1 = [row['inp_val'] for row in tst.select(tst.inp_val).where(tst.inp_val.substr(0, 1) == '&').collect()]
# removing the '&' at every starting of the list values
list_2 = [list_val[1:] for list_val in list_1]
tst_1 = tst.withColumn("val_extract", when(tst.inp_val.substr(0, 1) == '&', regexp(tst.inp_val, "&", "")).otherwise(tst.inp_val))
for val in list_2:
df_leaf = tst_1.select(tst_1.val_extract).where(tst_1.inp_col == val)
list_3 = [row['val_extract'] for row in df_leaf.collect()]
tst_1 = tst_1.withColumn('bool', when(tst_1.val_extract == val, 'True').otherwise('False'))
tst_1 = tst_1.withColumn('val_extract', when(tst_1.bool == 'True', str(list_3)).otherwise(tst_1.val_extract)).drop('bool')
Updated Expected Output:
|comment|inp_col|inp_val|inp_extract |
|11 |a |a1 |['a1'] |
|12 |a |a2 |['a2'] |
|12 |f |&a |['a1, 'a2'] |
|12 |f |f9 |['f9'] |
|15 |b |b3 |['b3'] |
|16 |b |b4 |['b4'] |
|17 |c |&b |['b3', 'b4'] |
|18 |c |c5 |['c5'] |
|19 |d |&c |['b3', 'b4', 'c5'] |
|20 |d |d6 |['d6'] |
|21 |e |&d |['b3', 'b4', 'c5', 'd6'] |
|22 |e |e7 |['e7'] |
After that I can try and do explode to get multiple row. But, the aove output is what we require and not able to get certain percent result.
You can join the data frame to itself to get this.