I want to query the syslog(basically its my SQL error log) using Athena. here is my sample data.
2019-09-21T12:19:32.107Z 2019-09-21 12:19:24.17 Server Buffer pool extension is already disabled. No action is necessary.
2019-09-21T12:19:32.107Z 2019-09-21 12:19:24.29 Server InitializeExternalUserGroupSid failed. Implied authentication will be disabled.
So I created a table like this:
CREATE EXTERNAL TABLE IF NOT EXISTS bhuvi (
timestamp string,
date string,
time string,
user string,
message stringg
) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(\\w+)\\s+(.*\\-.*\\-.*)\\s+(\\d+:\\d+:\\d+.\\d+)\\s+(\\w+)\\s+(\\w+)"
) LOCATION 's3://log/sql_error_log_stream/';
But it didn't give any results. Can someone help me to figure it out?
Few observations:
'2019-09-21T12:19:32.107Z'
is not in hiveTIMESTAMP
format, define it asSTRING
in DDL and convert like in this answer: https://stackoverflow.com/a/23520257/2700344message in the serde is represented as (\w+) group. This is wrong because message contains spaces. Try
(.*?)$
instead of(\\w+)
for message field.Try this regexp:
(\\S+)\\s+(.*-.*-.*)\\s+(\\d+:\\d+:\\d+\\.\\d+)\\s+(\\S+)\\s+(.*?)$
Use
(\\S+)
- this means everything except spaces.(\\w+)
does not work for the first group because\\w
matches any alphanumerical character and the underscore only, and first group (timestamp) contains-
and:
characters also.Also hyphen
-
if outside of character class [in square brackets] does not need shielding. and Dot . has a special meaning and needs shielding when used as dot literally: https://stackoverflow.com/a/57890202/2700344