AWS Glue SerDe Classifier seems to be very greedy

29 views Asked by At

I have log files that look like the following:

2023-08-12T13:46:54.577Z (2a25ec26-6b85-49cd-b9c5-b5455d24f71e) Usage Plan check succeeded for API Key *************************4Y01JJ and API Stage kbyw1pi8yl/production
2023-08-12T13:46:54.577Z (2a25ec26-6b85-49cd-b9c5-b5455d24f71e) Starting execution for request: 2a25ec26-6b85-49cd-b9c5-b5455d24f71e
2023-08-12T13:46:54.577Z (2a25ec26-6b85-49cd-b9c5-b5455d24f71e) HTTP Method: POST, Resource Path: /
2023-08-12T13:46:54.578Z (2a25ec26-6b85-49cd-b9c5-b5455d24f71e) API Key: *************************4Y01JJ
2023-08-12T13:46:54.578Z (2a25ec26-6b85-49cd-b9c5-b5455d24f71e) API Key ID: null
2023-08-12T13:46:54.578Z (2a25ec26-6b85-49cd-b9c5-b5455d24f71e) Method request path: {}
2023-08-12T13:46:54.578Z (2a25ec26-6b85-49cd-b9c5-b5455d24f71e) Method request query string: {}
2023-08-12T13:46:54.578Z (2a25ec26-6b85-49cd-b9c5-b5455d24f71e) Method request headers: {tracestate=3144164@nr=0-0-3144164-529450240-3bd89e21623c79ef-330955f8b18b434b-0-0.796626-1691848014574,, x-api-key=*************************4Y01JJ, traceparent=00-a9fc5401ab9b196028a76ca4bc4a87f9-3bd89e21623c79ef-00, X-Forwarded-Proto=https, X-Forwarded-For=13.55.73.250, Host=rtdemo.oztam.com.au, X-Forwarded-Port=443, newrelic=eyJ2IjpbMCwxXSwiZCI6eyJ0eSI6IkFwcCIsImFjIjoiMzE0NDE2NCIsImFwIjoiNTI5NDUwMjQwIiwidHIiOiJhOWZjNTQwMWFiOWIxOTYwMjhhNzZjYTRiYzRhODdmOSIsInByIjowLjc5NjYyNiwic2EiOmZhbHNlLCJ0aSI6MTY5MTg0ODAxNDU3NCwidHgiOiIzMzA5NTVmOGIxOGI0MzRiIiwiaWQiOiIzYmQ4OWUyMTYyM2M3OWVmIn19, X-Amzn-Trace-Id=Root=1-64d78d4e-137c6fce1951101a42bbf380, Content-Type=application/json; charset=utf-8}
2023-08-12T13:46:54.578Z (2a25ec26-6b85-49cd-b9c5-b5455d24f71e) Method request body before transformations: {"publisherId":"f5bb44c-5ad5-4c92-b989-7d17f48e131f","publisherName":"Seven","timestamp":"2023-08-12T13:46:54.5741539+00:00","mediaId":"FWWC23-012","demo1":"7d92f21eae0c4147a12d55cee697a4b5","deviceType":"tv","mediaType":"vod","deviceId":"e96b7203-0295-ff57-6f6d-f3b6031d525d","sessionId":"294650dd-f365-4412-dc7a-9da40ab0d74e","ipAddress":"115.64.23.121"}
2023-08-12T13:46:54.578Z (2a25ec26-6b85-49cd-b9c5-b5455d24f71e) Endpoint request URI: https://lambda.ap-southeast-2.amazonaws.com/2015-03-31/functions/arn:aws:lambda:ap-southeast-2:408865984143:function:oztam-realtime-demo/invocations

I am using a GROK classifier that looks like this:

%{TIMESTAMP_ISO8601:timestamp} \(%{UUID:session_id}\) %{MSGTYPE:msg_type}%{GREEDYDATA:syslog_message}

MSGTYPE (?:(.*?):)?\s?

The objective is to capture 4 fields:

  • timestamp
  • sessionid
  • msg_type
  • rest of msg

The msg_type should be any text up to the first : (after ts, sess which are capturing fine). My troubles lie in the MSGTYPE regex which works fine in regex editors but always seems very greedy in the classifier - it keeps matching up to the last : - I have tried many variations.

Note too that there are some lines where there is no semicolon - hence no msgtype so it should be empty in those cases.

This works on sample data in grokconstructor.appshot.com

Has anyone else seen this greediness in glue/classifiers? Or suggestions.

tx

1

There are 1 answers

0
Andrew Waites On

After some back and forth with AWS it seems that the crawler and/or the classifier is caching the patterns somewhere. The solution, for now, is to completely delete and recreate both crawler and classifier (that worked for me). That said, it may be sufficient to delete/recreate just one of these elements.