How to not split an umlaut?

129 views Asked by At

I want to use the following function to get words out of a text and put them into a list:

list = re.sub("[^\w]", " ",  'text hier einfügen').split()

Output of list is:

['text', 'hier', 'einfügen']

This works fine. But as soon as I add to my code the iso:

# -*- coding: iso-8859-1 -*-

...it doesn't work anymore. The ouput becomes:

['text', 'hier', 'einf', 'gen']

How can I avoid that? I need this iso coding because in HTML it wouldn't print the German text correctly otherwise.

Additional information (more details):

I have a form like this:

<form action="text_ch.py" method="post" name="search"><textarea cols="50" name="comment" rows="10">Text hier einfügen...</textarea>
<input type="submit" value="Analyse"><p></p>
</form>

And the python file is then:

#!/usr/bin/python
# -*- coding: iso-8859-1 -*-

import cgi
import re

form = cgi.FieldStorage()
user_text =  form.getvalue('comment')
user_text_output = user_text
wordList = re.sub("[^\w]", " ",  user_text).split()
wordList = [x.lower() for x in wordList]

# HTML Ausgabe

print "Content-type:text/html\r\n\r\n"
print '<html>'
print '<head>'
print '<title>Title</title>'
print '<meta charset=\"utf-8\"/>'
print '</head>'
print '<body>'
print '<div style=\"width: 40%; margin: auto; border: 1px solid #333;box-shadow: 8px 8px 5px #444;padding: 8px 12px; font-family: Arial, Helvetica, sans-serif; font-size:medium; line-height:1.5;\">'

print wordList
print "</div>"
print '</body>'
print '</html>'

And the output in HMTL is:

['text', 'hier', 'einf', 'gen']

1

There are 1 answers

3
Green Cloak Guy On

Instead of using [^\w] ("any non-word character"), just use [\s] ("any whitespace character").

Also, use the method re.split() instead of re.sub().split(). It's just more straightforward that way.

import re

lst = re.split("[\s]",  'text hier einfügen')
['text', 'hier', 'einfügen']

If you're worried about catching command characters like <, >, /, etc., then it's probably easiest to just manually list them inside the regex - there's only a finite amount inside ASCII that are valid HTML control characters, anyway.