I'm using a dataset and apparently it has "double quotes" wrapped around each row. I can't see it as it opens with Excel by default when I use my browser.
The dataset looks like this (raw):
"age;"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""----header 58;"management";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"--row
I use the following code:
val bank = spark.read.format("com.databricks.spark.csv").
| option("header", true).
| option("ignoreLeadingWhiteSpace", true).
| option("inferSchema", true).
| option("quote", "").
| option("delimiter", ";").
| load("bank_dataset.csv")
But what I get is: Data with quotes on either end and string values wrapped in double-double quotes What I instead want is: age as int and single quotes wrapped around string values
If you still have this raw data and want to clean, then you can use
regex_replace
to replace all double quotes"
Output: