With aws athena services, I try to import csv file including new line data
Importing data uses hive serde format.
If data is like this, (each data is enclosed in double quotes. "")
"DataA"|"DataB"|"DataC"
"Data1"|"Data2
with new line"|"Data3"
"Data가"|"Data2나"|"Data나"
then how to write regular expressions to below table DDL?
CREATE EXTERNAL TABLE ssdm_schema.ABCTable_regex (
Data_A VARCHAR(100)
, Data_B VARCHAR(100)
, Data_C VARCHAR(100)
) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = '?????????'
)
I'm asking to this question referring to the following answer.
How to handle embed line breaks in AWS Athena
Thank you
Solved it. https://regex101.com/r/bYF1Zm/3
"([\w\s\n\r\X]+?)"\|"([\X\w\s\n\r]+?)"\|"([\X\w\s\n\r]+?)"
with theg
lobal andu
nicode flags set.There were three things making this tricky:
This regex can probably be more succinct because the matching pattern repeats.