Linked Questions

Popular Questions

AWS Glue SerDe Classifier seems to be very greedy

Asked by At

I have log files that look like the following:

2023-08-12T13:46:54.577Z (2a25ec26-6b85-49cd-b9c5-b5455d24f71e) Usage Plan check succeeded for API Key *************************4Y01JJ and API Stage kbyw1pi8yl/production
2023-08-12T13:46:54.577Z (2a25ec26-6b85-49cd-b9c5-b5455d24f71e) Starting execution for request: 2a25ec26-6b85-49cd-b9c5-b5455d24f71e
2023-08-12T13:46:54.577Z (2a25ec26-6b85-49cd-b9c5-b5455d24f71e) HTTP Method: POST, Resource Path: /
2023-08-12T13:46:54.578Z (2a25ec26-6b85-49cd-b9c5-b5455d24f71e) API Key: *************************4Y01JJ
2023-08-12T13:46:54.578Z (2a25ec26-6b85-49cd-b9c5-b5455d24f71e) API Key ID: null
2023-08-12T13:46:54.578Z (2a25ec26-6b85-49cd-b9c5-b5455d24f71e) Method request path: {}
2023-08-12T13:46:54.578Z (2a25ec26-6b85-49cd-b9c5-b5455d24f71e) Method request query string: {}
2023-08-12T13:46:54.578Z (2a25ec26-6b85-49cd-b9c5-b5455d24f71e) Method request headers: {tracestate=3144164@nr=0-0-3144164-529450240-3bd89e21623c79ef-330955f8b18b434b-0-0.796626-1691848014574,, x-api-key=*************************4Y01JJ, traceparent=00-a9fc5401ab9b196028a76ca4bc4a87f9-3bd89e21623c79ef-00, X-Forwarded-Proto=https, X-Forwarded-For=13.55.73.250, Host=rtdemo.oztam.com.au, X-Forwarded-Port=443, newrelic=eyJ2IjpbMCwxXSwiZCI6eyJ0eSI6IkFwcCIsImFjIjoiMzE0NDE2NCIsImFwIjoiNTI5NDUwMjQwIiwidHIiOiJhOWZjNTQwMWFiOWIxOTYwMjhhNzZjYTRiYzRhODdmOSIsInByIjowLjc5NjYyNiwic2EiOmZhbHNlLCJ0aSI6MTY5MTg0ODAxNDU3NCwidHgiOiIzMzA5NTVmOGIxOGI0MzRiIiwiaWQiOiIzYmQ4OWUyMTYyM2M3OWVmIn19, X-Amzn-Trace-Id=Root=1-64d78d4e-137c6fce1951101a42bbf380, Content-Type=application/json; charset=utf-8}
2023-08-12T13:46:54.578Z (2a25ec26-6b85-49cd-b9c5-b5455d24f71e) Method request body before transformations: {"publisherId":"f5bb44c-5ad5-4c92-b989-7d17f48e131f","publisherName":"Seven","timestamp":"2023-08-12T13:46:54.5741539+00:00","mediaId":"FWWC23-012","demo1":"7d92f21eae0c4147a12d55cee697a4b5","deviceType":"tv","mediaType":"vod","deviceId":"e96b7203-0295-ff57-6f6d-f3b6031d525d","sessionId":"294650dd-f365-4412-dc7a-9da40ab0d74e","ipAddress":"115.64.23.121"}
2023-08-12T13:46:54.578Z (2a25ec26-6b85-49cd-b9c5-b5455d24f71e) Endpoint request URI: https://lambda.ap-southeast-2.amazonaws.com/2015-03-31/functions/arn:aws:lambda:ap-southeast-2:408865984143:function:oztam-realtime-demo/invocations

I am using a GROK classifier that looks like this:

%{TIMESTAMP_ISO8601:timestamp} \(%{UUID:session_id}\) %{MSGTYPE:msg_type}%{GREEDYDATA:syslog_message}

MSGTYPE (?:(.*?):)?\s?

The objective is to capture 4 fields:

  • timestamp
  • sessionid
  • msg_type
  • rest of msg

The msg_type should be any text up to the first : (after ts, sess which are capturing fine). My troubles lie in the MSGTYPE regex which works fine in regex editors but always seems very greedy in the classifier - it keeps matching up to the last : - I have tried many variations.

Note too that there are some lines where there is no semicolon - hence no msgtype so it should be empty in those cases.

This works on sample data in grokconstructor.appshot.com

Has anyone else seen this greediness in glue/classifiers? Or suggestions.

tx

Related Questions