Iterating through a unicode string in Python

7.3k views Asked by At

I've got an issue with iterating through unicode strings, character by character, with python.

print "w: ",word
for c in word:
    print "word: ",c

This is my output

w:  文本
word:  ? 
word:  ?
word:  ?
word:  ?
word:  ?
word:  ?

My desired output is:

文
本

When I use len(word) I get 6. Apparently each character is 3 unicode chunks.

So, my unicode string is successfully stored in the variable, but I cannot get the characters out. I have tried using encode('utf-8'), decode('utf-8) and codecs but still cannot obtain any good results. This seems like a simple problem but is frustratingly hard for me.

Hope someone can point me to the right direction.

Thanks!

4

There are 4 answers

10
Pruthvi Raj On BEST ANSWER
# -*- coding: utf-8 -*-
word = "文本"
print(word)
for each in unicode(word,"utf-8"):
    print(each)

Output:

文本
文
本
1
charpi On

The code I used which works is this

fileContent = codecs.open('fileName.txt','r',encoding='utf-8')
#...split by whitespace to get words..
for c in word:
        print(c.encode('utf-8'))
0
Tsing On

you should convert the word from string type to unicode:

print "w: ",word
for c in word.decode('utf-8'):
    print "word: ",c
0
DevB2F On

For python 3 this is what works:

import unicodedata

word = "文本"
word = unicodedata.normalize('NFC', word)
for char in word:
    print(char)