regex capture group which might not be present

46 views Asked by At

I'm running a regex through some log files. The capture groups should capture some relevant fields. I'd like to know if the logfile mentions a successful ending of the job or not. This can be concluded by the presence or absence of the string "Job executed successfully"

My regex so far: ^Job started at\s'(\d+\s\d+:\d+:\d+:\d+)'\s+orderno\s+-\s+'(\w+)'\s+runno\s+-\s+'(\d+)'[\s\S]+Host1\s'([\w.]+)'\[([\w-]+)\] username '([\w\\]+)' - Host2\s'([\w.]+)'\[([\w-]+)\] username '([\w\\]+)'[\s\S]+(Job executed successfully)?[\s\S]+Job ended at\s'(\d+\s\d+:\d+:\d+:\d+)'\s+Elapsed time\s\[([\d.]+)sec\]\sCPU usage\s\[([\d.]+)sec]

(I'm kind of new to regex, so it will not be perfect at all and needs some hardening)

A sample log with successful ending: The regex above will only work when the question mark behind "(Job executed successfully)?" is removed which should not be necessary in my opinion.

Job started at '0902 23:56:00:367' orderno - '0tzh0' runno - '00064' Number of transfers - 1

Host1 'Local'[Windows-LOCAL] username 'xxx\xxx' - Host2 'xxx.xxx.xx'[Unix-SFTP] username 'xxx'

Local host is: xxx - Windows 200x [601] Service Pack 1 build 7601 - Intel64 Family 6 Model 37 Stepping 1, GenuineIntel

********** Starting transfer #1 out of 1 *************** Transfer #1 completed successfully

Job executed successfully. exiting.

Job ended at '0902 23:56:07:138' Elapsed time [7sec] CPU usage [0.15sec]

A sample log with unsuccessful ending: The regex above works like it should.

Job started at '0831 15:26:00:365' orderno - '0tuq5' runno - '00030' Number of transfers - 4

Host1 'Local'[Windows-LOCAL] username 'xxx\xxx' - Host2 'xxx.xxx.xx'[Unix-SFTP] username 'xxx'

Local host is: xxx - Windows 200x [601] Service Pack 1 build 7601 - Intel64 Family 6 Model 37 Stepping 1, GenuineIntel

********** Starting transfer #1 out of 4 *************** Unable to connect to SSH server on 'xxx.xxx.xx': SFTP_Connect : psftp_connect failed : ssh_init: Network error: Connection timed out .

Connection to host sftp.onenet.be could not be established

Job ended at '0831 15:26:21:426'

Elapsed time [21sec] CPU usage [0.0sec]

2

There are 2 answers

0
Jerry On BEST ANSWER

With minimal change to your regex, you could use this one:

^Job started at\s'(\d+\s\d+:\d+:\d+:\d+)'\s+orderno\s+-\s+'(\w+)'\s+runno\s+-\s+'(\d+)'[\s\S]+?Host1\s'([\w.]+)'\[([\w-]+)\] username '([\w\\]+)' - Host2\s'([\w.]+)'\[([\w-]+?)\] username '([\w\\]+)'[\s\S]+?(?:(Job executed successfully)[\s\S]+?)?Job ended at\s'(\d+\s\d+:\d+:\d+:\d+)'\s+Elapsed time\s\[([\d.]+)sec\]\sCPU usage\s\[([\d.]+)sec]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------^^^-----------------------------------^^

(Main changes indicated by ^ above)

I also converted some quantifiers to lazy which should make things a little faster.

regex101 demo

Your current regex would match everything till the end due to greedy matching of [\s\S]+ and backtrack (from right to left) and test for (Job executed successfully)?[\s\S]+, and there, [\s\S]+ will match as soon as Job ended gets found.

In the above way, we check from left to right each character until we get to the part we need, i.e. Job executed successfully if it exists.

0
Jan On

If you're using PCRE, you could use the fabulous \Q...\E sequence along with a neg. lookahead:

^\QJob started\E
(?:(?!\QJob ended\E).)+?
^\QJob executed successfully\E

See a demo on regex101.com (and mind the multiline, verbose and singleline modifiers!).

If not, the whole expression gets somewhat unreadable:

^Job started(?:(?!Job ended).)+?^Job executed successfully