Python regex on wikitext template

147 views Asked by At

I'm trying to remove line breaks with Python from wikitext templates of the form:

{{cite web
|title=Testing
|url=Testing
|editor=Testing
}}

The following should be obtained with re.sub:

{{cite web|title=Testing|url=Testing|editor=Testing}}

I've been trying with Python regex for hours, yet haven't succeeded at it. For example I've tried:

while(re.search(r'\{cite web(.*?)([\r\n]+)(.*?)\}\}')):
     textmodif=re.sub(r'\{cite web(.*?)([\r\n]+)(.*?)\}\}', r'{cite web\1\3}}', textmodif,re.DOTALL)

But it doesn't work as expected (even without the while loop, it's not working for the first line break).

I found this similar question but it didnt help: Regex for MediaWiki wikitext templates . I'm quite new at Python so please don't be too hard on me :-)

Thank you in advance.

1

There are 1 answers

0
Martijn Pieters On BEST ANSWER

You need to switch on newline matching for .; it does not match a newline otherwise:

re.search(r'\{cite web(.*?)([\r\n]+)(.*?)\}\}', inputtext, flags=re.DOTALL)

You have multiple newlines spread throughout the text you want to match, so matching just one set of consecutive newlines is not enough.

From the re.DOTALL documentation:

Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

You could use one re.sub() call to remove all newlines within the cite stanza in one go, without a loop:

re.sub(r'\{cite web.*?[\r\n]+.*?\}\}', lambda m: re.sub('\s*[\r\n]\s*', '', m.group(0)), inputtext, flags=re.DOTALL)

This uses a nested regular expression to remove all whitespace with at least one newline in it from the matched text.

Demo:

>>> import re
>>> inputtext = '''\
... {{cite web
... |title=Testing
... |url=Testing
... |editor=Testing
... }}
... '''
>>> re.search(r'\{cite web(.*?)([\r\n]+)(.*?)\}\}', inputtext, flags=re.DOTALL)
<_sre.SRE_Match object at 0x10f335458>
>>> re.sub(r'\{cite web.*?[\r\n]+.*?\}\}', lambda m: re.sub('\s*[\r\n]\s*', '', m.group(0)), inputtext, flags=re.DOTALL)
'{{cite web|title=Testing|url=Testing|editor=Testing}}\n'