How to format CSV data by removing quotes and double-quotes around fields

506 views Asked by At

I'm using a dataset and apparently it has "double quotes" wrapped around each row. I can't see it as it opens with Excel by default when I use my browser.

The dataset looks like this (raw):

"age;"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""----header 58;"management";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"--row

I use the following code:

val bank = spark.read.format("com.databricks.spark.csv").
 | option("header", true).
 | option("ignoreLeadingWhiteSpace", true).
 | option("inferSchema", true).
 | option("quote", "").
 | option("delimiter", ";").
 | load("bank_dataset.csv")

But what I get is: Data with quotes on either end and string values wrapped in double-double quotes What I instead want is: age as int and single quotes wrapped around string values

1

There are 1 answers

0
koiralo On

If you still have this raw data and want to clean, then you can use regex_replace to replace all double quotes"

val expr = df.columns
.map(c => regexp_replace(col(c), "\"", "").as(c.replaceAll("\"", "")))

df.select(expr: _*).show(false)

Output:

+---+----------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+
|age|job       |marital|education|default|balance|housing|loan|contact|day|month|duration|campaign|pdays|previous|poutcome|y  |
+---+----------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+
|58 |management|married|tertiary |no     |2143   |yes    |no  |unknown|5  |may  |261     |1       |-1   |0       |unknown |no |
+---+----------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+