Regex to capture the c langage comment use python

56 views Asked by At

This is my regular expression:

(//\s*.*)|(?s)(/\*(\s*(.*?)\s*)\*/)

I test it in http://regex101.com/r/yJ0oA6 website use below text.And as you can see everything is fine.But when type in python code I can not capture the target string.

Python excerpt

regex_1 = r'(//\s*.*)|(?sm)(/\*(\s*(.*?)\s*)\*/)'
pattern = re.compile(regex_1)
print re.findall(pattern,content)

Output

[('// The variable counter, which is about to be defined, is going\n// to start with a value of 0, which is zero.\nvar counter = 0;\n// Now, we are going to loop, hold on to your hat.\nwhile (counter < 100 /* counter is less than one hundred */)\n/* Every time we loop, we INCREMENT the value of counter,\n   Seriously, we just add one to it. */\n  counter++;\n// And then, we are done.\n', '', '', '')]

It should be match six comment lines,but only return the above result,Why? Does I miss something?

2

There are 2 answers

1
Oscar Mederos On BEST ANSWER

First of all, I wouldn't recommend using regular expressions for doing that, but if you know what you're doing, the following regex will work for you:

regex_1 = r'(//[^\n]+)|(/\*.+?\*/)'

I cleaned yours a little bit. It will basically match either:

  • // until the end of the line
  • /* <anything here> */ (non-greedy, of course).

The second case needs to handle multiple lines, and you can do that by specifying the re.DOTALL flag when calling re.compile:

pattern = re.compile(regex_1, re.DOTALL)

Below is the output:

('// The variable counter, which is about to be defined, is going', '')
('// to start with a value of 0, which is zero.', '')
('// Now, we are going to loop, hold on to your hat.', '')
('', '/* counter is less than one hundred */')
('', '/* Every time we loop, we INCREMENT the value of counter,\n   Seriously, we just add one to it. */')
('// And then, we are done.', '')

which is exactly what you're looking for in your example.

1
Explosion Pills On

This is due to what I'd say is a bug in python. From http://www.regular-expressions.info/modifiers.html

Flavors that can't apply modifiers to only part of the regex treat a modifiers in the middle of the regex as an error. Python is an exception to this. In Python, putting a modifier in the middle of the regex affects the whole regex.

So unfortunately you can't use (?sm). Instead, you can use [\s\S] to match newlines:

(//\s*.*)|(/\*(\s*([\s\S]*?)\s*)\*/)

I might point out that \s*.* is wrong since this would allow a newline after // which would be invalid. I think it should just be //.*. It is also important to note that this will find comments inside of string literals too, so you have to be careful.