How do I resolve this error in saving a new odt after a regex?

65 views Asked by At

I've been trying to find good documentation to solve this ... but from what I can see from what little documentation, this code should have worked ... I'm rather curious as to why this isn't working, but I'm certainly not an expert.

>>> import sys
>>> import re
>>> from odf.opendocument import load
>>> from odf import text, teletype
>>> infile = load(r'C:\Users\Iainc\Documents\The Seventh Story.odt')
>>> for item in infile.getElementsByType(text.P):
...     s = teletype.extractText(item)
...     m = re.sub(r'\[\((?:(?!\[\().)*?\)\]', '', s);
...     if m != s:
...             new_item = text.P()
...             new_item.setAttribute('stylename', item.getAttribute('stylename'))
...             new_item.addText(m)
...             item.parentNode.insertBefore(new_item, item)
...             item.parentNode.removeChild(item)
... infile.save(r'C:\Users\Iainc\Documents\The Seventh Story 2.odt')
  File "<stdin>", line 10
    infile.save(r'C:\Users\Iainc\Documents\The Seventh Story 2.odt')
    ^^^^^^
SyntaxError: invalid syntax

This is supposed to go through a document full of multiple nested notes (ex, "[(blah blah [(blah [(blah (blah) blah)] )] blah )]") and remove all the notes, only leaving the text before the first "[(" or after the last ")]". I think this code should work to do this, so far as I can tell, but why the error? And I'm not certain even the filter is quite working as it should.

1

There are 1 answers

3
MikeM On

I don't know why you are getting the SyntaxError, but to remove all the notes while leaving the text between each group of nested notes, re.sub will probably need to be called repeatedly in a loop.

Your regex matches from [( to the first occurence of )] that follows it, but not if [( appears again between them. This has the effect of matching the innermost note of each group of nested notes, which is then substituted for the empty string to remove it.

To match across line endings you're going to need the re.DOTALL flag or to put (?s) at the start of the regex, or to use a match-any-character class like [\S\s] instead of .

For example:

import re

text = '''
beginning [(blah blah [(blah [(blah (blah) blah)] )] blah 
blah (blah) blah blah )] middle [(blah blah [(blah [(blah
(blah) blah)] )] blah blah (blah) blah blah )] end
'''

t = ''
while t != text:
   t = text
   text = re.sub(r'\[\((?:(?!\[\().)*?\)\]', '', text, flags=re.DOTALL)
   
print(text)    
# beginning  middle  end