Regex Expression to capture repeated patterns

416 views Asked by At

I've been running around internet trying to find out how to build a regular expression to capture text in the way I need it; so I saw some StackOverflow questions but none of them express what I want, but if you already saw something similar to my issue here, pelase feel free to pointme to that article...

I tried to use recursion but it seems I'm not good enough to get something to work

Some notes:

1) I can't use a parse program because the program that will use this data will use regular expression to capture it, and this program is a "general purpose" program that in fact is capturing any data that is needed, only thing I need to do is give proper regular expression to get information it needs, also I need to keep it as copact as possible, so I can't use third party or external programs.

2) The pair 'key': 'value' can vary, they are not always the same number of pairs... that is what make it difficult I believe.

3) Program that is going to use this regex is created in Python 2.7.3: How this program works: it uses a Json config file where I can setup command I want to run that will give to me data I need, then I specify a regex to teach to the program what need to be captured and how to handle it ie: what to do with the groups that get captured... so that is why I can't use a parser. This program uses fabric to run configued collector(with the regex) to remote hosts and gather all data...

4) Program is used to gather data to post them into a webserver and get metrics and other stuff like graphs and monitor alarms etc

I have been able to capture almost all data I was planing to capture but when I was trying to create a collector for this then I got stuck..

The following data repeats exactly like below but with different server names, of course values will change too:

Server: Omega-X
celery.queue_length: {'transfer_data': '0', 'factor_a': '0', 'slow': '0', 'factor_b': '0', 'score_retry': '0', 'damage_factor_c': '0', 'voice_ud': '0', 'alarm_factors_bl': '0', 'telemetry_x': '0', 'endstream': '0', 'celery': '0', 'awl': '0', 'prs': '0', 'score': '0', 'feature_factors_xf': '0', 'feature_factors_dc': '0'}


Server: Alfa-X
celery.queue_length: {'transfer_data': '0', 'factor_a': '0', 'slow': '0', 'factor_b': '0', 'score_retry': '0', 'damage_factor_c': '0', 'voice_ud': '0', 'alarm_factors_bl': '0', 'telemetry_x': '0', 'endstream': '0', 'celery': '0', 'awl': '0', 'prs': '0', 'score': '0', 'feature_factors_xf': '0', 'feature_factors_dc': '0'}

How I want to capture it:

Server: Omega-X

 transfer_data: 0
 factor_a: 0
 slow: 0
 factor_b: 0
 score_retry: 0
 damage_factor_c: 0
 voice_ud: 0
 alarm_factors_bl: 0
 telemetry_x: 0
 endstream: 0
 celery: 0
 awl: 0
 trx: 0
 points: 0
 feature_factors_xf: 0
 feature_factors_dc: 0

Server: Alfa-X

 transfer_data: 0
 factor_a: 0
 slow: 0
 factor_b: 0
 score_retry: 0
 damage_factor_c: 0
 voice_ud: 0
 alarm_factors_bl: 0
 telemetry_x: 0
 endstream: 0
 celery: 0
 awl: 0
 trx: 0
 points: 0
 feature_factors_xf: 0
 feature_factors_dc: 0

If a unique server is shown, then is not so difficult, using the below regex I'm able to capture all (except name of server):

'([a-z_]+)':\s'(\d+)'

This regex will give only the second part, which is the list of variables and values, but not the Server name... so if I get on same output several servers with the same data, then will be impossible to know from which server the values are coming from...

If I try to add support for the server name: I've tried follwoing regex, it works but only capture Server name, and first pair of parameters:

Server:\s([a-zA-Z0-9-]+)\s*celery\.queue_length:\s.('([a-z_]+)':\s'(\d+)')*

I had tried multiple recursion features but I've failed to achieve what I want.

Can anyone point me to right direction here...?

Thanks.

3

There are 3 answers

2
mquantin On BEST ANSWER

You want key-value ? with python I would use the dictionary.

  1. get the server name and the string containing the data:
    Server: ([^\n]*)(?:[^{]*)\{(.*)\}

  2. build a dict with the string containing the data for each server:

With python (you only need import re statement):

input = """Server: Omega-X
celery.queue_length: {'transfer_data': '0', 'factor_a': '0', 'slow': '0', 'factor_b': '0', 'score_retry': '0', 'damage_factor_c': '0', 'voice_ud': '0', 'alarm_factors_bl': '0', 'telemetry_x': '0', 'endstream': '0', 'celery': '0', 'awl': '0', 'prs': '0', 'score': '0', 'feature_factors_xf': '0', 'feature_factors_dc': '0'}

Server: Alfa-X
celery.queue_length: {'transfer_data': '0', 'factor_a': '0', 'slow': '0', 'factor_b': '0', 'score_retry': '0', 'damage_factor_c': '0', 'voice_ud': '0', 'alarm_factors_bl': '0', 'telemetry_x': '0', 'endstream': '0', 'celery': '0', 'awl': '0', 'prs': '0', 'score': '0', 'feature_factors_xf': '0', 'feature_factors_dc': '0'}"""


for match in re.findall(r'Server: ([^\n]*)(?:[^{]*)\{(.*)\}', input):
    server = match[0]
    data = match[1]
    datadict = dict((k.strip().replace("'", ""), v.strip().replace("'", "")) for k,v in (item.split(':') for item in data.split(',')))
    datadict['serveur'] = server

Then you can store each datadict (e.g. in a list) and use then as you want. You can cast the values from string to integer to manipulate them easily.

0
Larry On

thanks guys that kindly responded my question, I think both of you help me to reshape way I'm seeing this issue...

My believe is, what I want to achieve here is very difficult for a regex:

Giving the difficulty of how to get information I want. I was thinking in which way will be easier for me to get this information. So I know I'm going against my own rules here, but I think there's no other way to go smoothly I believe.

If I want to get regex group like:

Server: Group 0
Key : Group 1
Value: Group 2

then output I will need should be like:

Regex Groups:
        (0)      (1)          (2)         
Server: Omega-X transfer_data: 0
Server: Omega-X factor_a: 0
Server: Omega-X slow: 0
Server: Omega-X factor_b: 0
Server: Omega-X score_retry: 0
Server: Omega-X damage_factor_c: 0
Server: Omega-X voice_ud: 0
Server: Omega-X alarm_factors_bl: 0
Server: Omega-X telemetry_x: 0
Server: Omega-X endstream: 0
Server: Omega-X celery: 0
Server: Omega-X awl: 0
Server: Omega-X trx: 0
Server: Omega-X points: 0
Server: Omega-X feature_factors_xf: 0
Server: Omega-X feature_factors_dc: 0

In this way I can process any number of servers in the same output without any difficult and using a very simple regex...

"Server:\s([a-zA-Z_.-]+)\s'([a-zA-Z_]+)':\s'(\d+)'"

So I think the best way to go, is adding a Pre-Parser to prepare data like this, and then process it...

In fact, both of you help me on this, much appreciated.

I guess I will close this question unless somebody else as a better idea :)

3
Hooman Bahreini On

You can use Antlr, to define your grammer and would be a better option than regular expression: https://dzone.com/articles/antlr-4-with-python-2-detailed-example

If you want to use regular expression, you can use the following, please note my code is in C#, but regular expression should behave the same in Python.

string serverNamePattern = @"(?<=Server(\s)*:(\s))\s*[\w-]+";
string dataPattern = @"(?<=celery.queue_length[\s:]*{)[a-zA-Z0-9\s:\'_,]+";
string input = 
    "Server: Omega-X" + 
    "celery.queue_length: {'transfer_data': '0', 'factor_a': '0', 'slow': '0', 'factor_b': '0', 'score_retry': '0', 'damage_factor_c': '0', 'voice_ud': '0', 'alarm_factors_bl': '0', 'telemetry_x': '0', 'endstream': '0', 'celery': '0', 'awl': '0', 'prs': '0', 'score': '0', 'feature_factors_xf': '0', 'feature_factors_dc': '0'}" + 
    "Server: Alfa-X" + 
    "celery.queue_length: {'transfer_data': '0', 'factor_a': '0', 'slow': '0', 'factor_b': '0', 'score_retry': '0', 'damage_factor_c': '0', 'voice_ud': '0', 'alarm_factors_bl': '0', 'telemetry_x': '0', 'endstream': '0', 'celery': '0', 'awl': '0', 'prs': '0', 'score': '0', 'feature_factors_xf': '0', 'feature_factors_dc': '0'}";

var serverNames = Regex.Matches(input, serverNamePattern);
var dataMatches = Regex.Matches(input, dataPattern);

Explanation:

+: one or more occurrence

\w: alphanumeric

\s: white space

[]: define a range

(?<=a)b: positive lookbehind, match b that comes after a

(?<=Server(\s):(\s))\s[\w-]+: match alphanumeric,- and white space that comes after Server:

(?<=celery.queue_length[\s:]*{)[a-zA-Z0-9\s:\',]+: match a range of [a-zA-Z0-9':,\s] that comes after celery.queue_length:

Note that you need to add "Server: " before server name. also this does not remove single quotes from the data.