Python regex capture multiple groups N number of times

916 views Asked by At

I am parsing out /proc/PID/stat of a process. The file has input of:

25473 (firefox) S 25468 25465 25465 0 -1 4194304 149151169 108282 32 15 2791321 436115 846 86 20 0 84 0 9648305 2937786368 209665 18446744073709551615 93875088982016 93875089099888 140722931705632 140722931699424 140660842079373 0 0 4102 33572009 0 0 0 17 1 0 0 175 0 0 93875089107104 93875089109128 93875116752896 140722931707410 140722931707418 140722931707418 140722931707879 0

I came up with:

import re

def get_stats(pid):
    with open('/proc/{}/stat'.format(pid)) as fh:
        stats_raw = fh.read()
    stat_pattern = '(\d+\s)(\(.+\)\s)(\w+\s)(-?\d+\s?)'
    return re.findall(stat_pattern, stats_raw)

This will match the first three groups but only return one field for the last group of (-?\d+\s?):

[('25473 ', '(firefox) ', 'S ', '25468 ')]

I was looking for a way to match only set number for the last group:

'(\d+\s)(\(.+\)\s)(\w+\s)(-?\d+\s?){49}'
1

There are 1 answers

0
Wiktor Stribiżew On BEST ANSWER

You cannot access each repeated capture with re regex. You may capture the whole rest of the string into Group 4 and then split with whitespace:

import re
s = r'25473 (firefox) S 25468 25465 25465 0 -1 4194304 149151169 108282 32 15 2791321 436115 846 86 20 0 84 0 9648305 2937786368 209665 18446744073709551615 93875088982016 93875089099888 140722931705632 140722931699424 140660842079373 0 0 4102 33572009 0 0 0 17 1 0 0 175 0 0 93875089107104 93875089109128 93875116752896 140722931707410 140722931707418 140722931707418 140722931707879 0'
stat_pattern = r'(\d+)\s+(\([^)]+\))\s+(\w+)\s*(.*)'
res = []
for m in re.finditer(stat_pattern, s):
    res.append(m.group(1))
    res.append(m.group(2))
    res.append(m.group(3))
    res.extend(m.group(4).split())
print(res)

Output:

['25473', '(firefox)', 'S', '25468', '25465', '25465', '0', '-1', '4194304', '149151169', '108282', '32', '15', '2791321', '436115', '846', '86', '20', '0', '84', '0', '9648305', '2937786368', '209665', '18446744073709551615', '93875088982016', '93875089099888', '140722931705632', '140722931699424', '140660842079373', '0', '0', '4102', '33572009', '0', '0', '0', '17', '1', '0', '0', '175', '0', '0', '93875089107104', '93875089109128', '93875116752896', '140722931707410', '140722931707418', '140722931707418', '140722931707879', '0']

If you literally need to only get 49 numbers into Group 4, use

r'(\d+)\s+(\([^)]+\))\s+(\w+)\s*((?:-?\d+\s?){49})'
                                ^^^^^^^^^^^^^^^^^^

With PyPi regex module, you may use r'(?P<o>\d+)\s+(?P<o>\([^)]+\))\s+(?P<o>\w+)\s+(?P<o>-?\d+\s?){49}' and after running a regex.search(pattern, s) access .captures("o") stack with the values you need.

>>> import regex
>>> s = '25473 (firefox) S 25468 25465 25465 0 -1 4194304 149151169 108282 32 15 2791321 436115 846 86 20 0 84 0 9648305 2937786368 209665 18446744073709551615 93875088982016 93875089099888 140722931705632 140722931699424 140660842079373 0 0 4102 33572009 0 0 0 17 1 0 0 175 0 0 93875089107104 93875089109128 93875116752896 140722931707410 140722931707418 140722931707418 140722931707879 0'
>>> stat_pattern = r'(?P<o>\d+)\s+(?P<o>\([^)]+\))\s+(?P<o>\w+)\s+(?P<o>-?\d+\s?){49}'
>>> m = regex.search(stat_pattern, s)
>>> if m:
    print(m.captures("o"))

Output:

['25473', '(firefox)', 'S', '25468 ', '25465 ', '25465 ', '0 ', '-1 ', '4194304 ', '149151169 ', '108282 ', '32 ', '15 ', '2791321 ', '436115 ', '846 ', '86 ', '20 ', '0 ', '84 ', '0 ', '9648305 ', '2937786368 ', '209665 ', '18446744073709551615 ', '93875088982016 ', '93875089099888 ', '140722931705632 ', '140722931699424 ', '140660842079373 ', '0 ', '0 ', '4102 ', '33572009 ', '0 ', '0 ', '0 ', '17 ', '1 ', '0 ', '0 ', '175 ', '0 ', '0 ', '93875089107104 ', '93875089109128 ', '93875116752896 ', '140722931707410 ', '140722931707418 ', '140722931707418 ', '140722931707879 ', '0']