Hive - Regex for the SYSLOG/ERRORLOG

177 views Asked by At

I want to query the syslog(basically its my SQL error log) using Athena. here is my sample data.

2019-09-21T12:19:32.107Z 2019-09-21 12:19:24.17 Server      Buffer pool extension is already disabled. No action is necessary. 

2019-09-21T12:19:32.107Z 2019-09-21 12:19:24.29 Server      InitializeExternalUserGroupSid failed. Implied authentication will be disabled.

So I created a table like this:

CREATE EXTERNAL TABLE IF NOT EXISTS bhuvi (
  timestamp string,
  date string,
  time string,
  user string,
  message stringg
 ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
 WITH SERDEPROPERTIES (
 "input.regex" = "(\\w+)\\s+(.*\\-.*\\-.*)\\s+(\\d+:\\d+:\\d+.\\d+)\\s+(\\w+)\\s+(\\w+)"
 ) LOCATION 's3://log/sql_error_log_stream/';

But it didn't give any results. Can someone help me to figure it out?

1

There are 1 answers

2
leftjoin On BEST ANSWER

Few observations:

  1. Timestamp '2019-09-21T12:19:32.107Z' is not in hive TIMESTAMP format, define it as STRING in DDL and convert like in this answer: https://stackoverflow.com/a/23520257/2700344
  2. message in the serde is represented as (\w+) group. This is wrong because message contains spaces. Try (.*?)$ instead of (\\w+) for message field.

    Try this regexp:

    (\\S+)\\s+(.*-.*-.*)\\s+(\\d+:\\d+:\\d+\\.\\d+)\\s+(\\S+)\\s+(.*?)$

Use (\\S+) - this means everything except spaces. (\\w+) does not work for the first group because \\w matches any alphanumerical character and the underscore only, and first group (timestamp) contains - and : characters also.

Also hyphen - if outside of character class [in square brackets] does not need shielding. and Dot . has a special meaning and needs shielding when used as dot literally: https://stackoverflow.com/a/57890202/2700344