regex for access log in hive serde with newline

510 views Asked by At

With aws athena services, I try to import csv file including new line data

Importing data uses hive serde format.

If data is like this, (each data is enclosed in double quotes. "")

"DataA"|"DataB"|"DataC"
"Data1"|"Data2
with new line"|"Data3"
"Data가"|"Data2나"|"Data나"

then how to write regular expressions to below table DDL?

CREATE EXTERNAL TABLE ssdm_schema.ABCTable_regex (  
  Data_A VARCHAR(100)  
, Data_B VARCHAR(100)  
, Data_C VARCHAR(100)  
) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'  
WITH SERDEPROPERTIES (  
"input.regex" = '?????????'  
) 

I'm asking to this question referring to the following answer.

How to handle embed line breaks in AWS Athena

Thank you

1

There are 1 answers

0
Freiheit On

Solved it. https://regex101.com/r/bYF1Zm/3

"([\w\s\n\r\X]+?)"\|"([\X\w\s\n\r]+?)"\|"([\X\w\s\n\r]+?)" with the global and unicode flags set.

There were three things making this tricky:

  1. The line break in the middle of the data on line 2
  2. The white space in the middle of the data on line 2
  3. The unicode

This regex can probably be more succinct because the matching pattern repeats.