I have been trying to parse my python traceback logs using logstash. My logs look like this:
[pid: 26422|app: 0|req: 73/73] 192.168.1.1 () {34 vars in 592 bytes} [Wed Feb 18 13:35:55 2015] GET /data => generated 2538923 bytes in 4078 msecs (HTTP/1.1 200) 2 headers in 85 bytes (1 switches on core 0)
Traceback (most recent call last):
File "/var/www/analytics/parser.py", line 257, in parselogfile
parselogline(basedir, lne)
File "/var/www/analytics/parser.py", line 157, in parselogline
pval = understandpost(parts[3])
File "/var/www/analytics/parser.py", line 98, in understandpost
val = json.loads(dct["events"])
File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Unterminated string starting at: line 1 column 355 (char 354)
So far I've been able to parse the logs except the last line i.e.
ValueError: Unterminated string starting at: line 1 column 355 (char 354)
I'm using the multiline filter to do so. My logstash configuration looks something like this:
filter {
multiline {
pattern => "^Traceback"
what => "previous"
}
multiline {
pattern => "^ "
what => "previous"
}
grok {
match => [
"message", "\[pid\: %{NUMBER:process_id:int}\|app: 0\|req: %{NUMBER}/%{NUMBER}\] %{IPORHOST:clientip} \(\) \{%{NUMBER:vars:int} vars in %{NUMBER:bytes:int} bytes\} \[%{GREEDYDATA:timestamp}\] %{WORD:method} /%{GREEDYDATA:referrer} \=\> generated %{NUMBER:generated_bytes:int} bytes in %{NUMBER} msecs \(HTTP/%{NUMBER} %{NUMBER:status_code:int}\) %{NUMBER:headers:int} headers in %{NUMBER:header_bytes:int} bytes \(%{NUMBER:switches:int} switches on core %{NUMBER:core:int}\)%{GREEDYDATA:traceback}"
]
}
if "_grokparsefailure" in [tags] {
multiline {
pattern => "^.*$"
what => "previous"
negate => "true"
}
}
if "_grokparsefailure" in [tags] {
grok {
match => [
"message", "\[pid\: %{NUMBER:process_id:int}\|app: 0\|req: %{NUMBER}/%{NUMBER}\] %{IPORHOST:clientip} \(\) \{%{NUMBER:vars:int} vars in %{NUMBER:bytes:int} bytes\} \[%{GREEDYDATA:timestamp}\] %{WORD:method} /%{GREEDYDATA:referrer} \=\> generated %{NUMBER:generated_bytes:int} bytes in %{NUMBER} msecs \(HTTP/%{NUMBER} %{NUMBER:status_code:int}\) %{NUMBER:headers:int} headers in %{NUMBER:header_bytes:int} bytes \(%{NUMBER:switches:int} switches on core %{NUMBER:core:int}\)%{GREEDYDATA:traceback}"
]
remove_tag => ["_grokparsefailure"]
}
}
}
But my last line isn't parsing. Instead it is still giving me an error and has also increased the processing time exponentially. Any advice on how to parse the last line of the traceback?
Well, I found a solution. So the approach I followed is that I will ignore the starting of a log message which starts with '['and all the other lines will be appended at the end of the previous message. Then grok filter can be applied and the traceback can be parsed. Note that I have to apply two grok filters:
For when there is a traceback with GREEDYDATA to get the traceback.
For when there is no traceback, GREEDYDATA parsing fails and I'll have to remove the _grokparsefailure tag and then again apply grok without GREEDYDATA. This is done with the help of if block.
The final logstash filter looks something like this:
Alternatively, if you don't want to use the if block to check another grok pattern and remove the
_grokparsefailure
, you can use the first grok filter to check for both the message types by including multiple message-pattern checks in thematch
array of grok filter. It can be done like this:And there is a third approach as well (possibly the most elegant one). It looks something like this:
Note that in this method, the field whose existence is optional has to be enclosed in "()?". Here,
(%{GREEDYDATA:traceback})?
Thus, the grok filter sees that if the field is available, it will be parsed. Otherwise, it will be skipped.