How to handle escape characters in pyspark. Trying to replace escape character with NULL, when column value is '\026' in dataframe

1.8k views Asked by At

How to handle escape characters in pyspark. Trying to replace escape character with NULL

'\026' is randomly spreadout through all the columns and I have replace to '\026' with NULL across all columns

below is my sample input data

col1,col2,col3,Col4    
1,\026\026,abcd026efg,1|\026\026|abcd026efg            
2,\026\026,\026\026\026,2|026\026|\026\026\026         
3,ad026eg,\026\026,3|ad026eg|\026\026       
4,ad026eg,xyad026,4|ad026eg|xyad026  

and, my out data should be

col1|col2|col3|col4|      
1,NULL,abcd026efg,1||abcd026efg|   
2,NULL,NULL,2|NULL|NULL|   
3,ad026eg,NULL,3|ad026eg|NULL|       
4,ad026eg,xyad026,4|ad026eg|xyad026|

Note: Col4 is combined columns of col1, col2, col3 with | delimited

 df.withColumn('col2',F.regexp_replace('col2','\D\d+',None)).show().
 This is working but it is replacing all the cell values with NULL.
1

There are 1 answers

2
Chandra Babu On

Try this if u want to do it in rdd:

rddd = df.rdd.map(
    lambda x: [re.sub(r"\\026", "", x[i].strip()) for i in range(len(x))]
).map(lambda x: [None if x[i] == "" else x[i].strip() for i in range(len(x))])

df2=rddd.toDF(["a","b","c","d"])

df2.show()

enter image description here