Filtering Log File with RegEx

112 views Asked by At

Hi I can't seem to work out how to extract the Date and PID from a log file. I'm trying to display the date and then the pid as shown below. But it will not show the PID only the date.

Please see my code:

def show_time_of_pid(line):

  pattern = r"^([\w+]*[\s\d\:]+.[\[(\d+)\]])"
  result = re.search(pattern, line)

  return result

print(show_time_of_pid("Jul 6 14:01:23 computer.name CRON[29440]: USER (good_user)")) # Jul 6 14:01:23 pid:29440
<re.Match object; span=(0, 14), match='Jul 6 14:01:23'>

I was expecting Jul 6 14:01:23 pid:29440

I get <re.Match object; span=(0, 14), match='Jul 6 14:01:23'> **NO PID DISPLAYED

2

There are 2 answers

1
larsks On BEST ANSWER

I would probably write things like this:

def show_time_of_pid(line):

    pattern = r"^(\w{3}) \s (\d+) \s ([\d:]+) \s .[^[]+\[(\d+)]:.*"
    result = re.search(pattern, line, flags=re.VERBOSE)

    return result.groups()

print(show_time_of_pid("Jul 6 14:01:23 computer.name CRON[29440]: USER (good_user)"))

Using re.VERBOSE lets us split things up to be a little easier to read. Here we have several distinct match groups:

  • (\w{3}) matches the month name
  • (\d+) matches the day of the month
  • ([\d:]+) matches the time
  • [^[]+\[(\d+)] matches the PID ("a bunch of characters that are not [ followed by [, then a string of digits, then ])

Each group is separated by whitespace (\s).

Running the above code produces:

('Jul', '6', '14:01:23', '29440')

You could get fancier with an outer capture group; by writing:

import re

def show_time_of_pid(line):

    pattern = r"^((\w{3}) \s (\d+) \s ([\d:]+)) \s .[^[]+\[(\d+)]:.*"
    result = re.search(pattern, line, flags=re.VERBOSE)

    return result.groups()

print(show_time_of_pid("Jul 6 14:01:23 computer.name CRON[29440]: USER (good_user)"))

We get the entire date string in the first capture group:

('Jul 6 14:01:23', 'Jul', '6', '14:01:23', '29440')

And of course we can get back a labeled dictionary instead of just a list by using named capture groups:

import re

def show_time_of_pid(line):

    pattern = r"^(?P<timestamp>(?P<month>\w{3}) \s (?P<day>\d+) \s ([\d:]+)) \s .[^[]+\[(?P<pid>\d+)]:.*"
    result = re.search(pattern, line, flags=re.VERBOSE)

    return result.groupdict()

print(show_time_of_pid("Jul 6 14:01:23 computer.name CRON[29440]: USER (good_user)"))

Which produces:

{'timestamp': 'Jul 6 14:01:23', 'month': 'Jul', 'day': '6', 'pid': '29440'}
0
Mark On

Hey can someone tell me if this is an acceptable work around for my problem - it seemed to work! Thankyou for your replies too - appreciate it. Its hard to get your head around this stuff!!

def show_time_of_pid(line):

    pattern1 = r"^([\w]*[\s\d:]*)"
    pattern2 =r"\[(\d+)\]"
    result = re.search(pattern1, line)
    result2= re.search(pattern2,line)

  return "{} pid:{}".format(result[1],result2[1])