I have a Farsi word that if shown in UTF-8 coding is like this:
"خطاب"
I have two versions of this word, both in Notepad++ in UTF-8 are shown as above. But if I look at them in ANSI mode then I see:
ïºïºŽï»„ﺧ
and for the other one I see:
خطاب
How come the same words have such a different representation in ANSI format? When I use PIL in Python to draw these, the result is correct for one of these and not correct for the other.
I appreciate any help on this.
In Unicode you can represent some character in more than one way. In this case, these Arabic characters are represented with code points from the Arabic Presentation Forms-B Block in the first case, and with code points from the regular Arabic Block in the second case.
If you convert the text
to a byte stream, you get
Notice that you are not seeing a character representing the
0F
byte in the text above, because it's a non-visual character.Now that byte stream is representing a UTF-8-encoded text. Decoding it will give you the following Unicode code points:
You can match those in the Arabic Presentation Forms-B Block to form your Farsi text:
You can do the same process for the other text:
خطاب
gives you the byte streamD8AE D8B7 D8A7 D8A8
, which represents UTF-8-encoded text, which decoded gives you the Unicode code points062e 0637 0627 0628
, which matched to the regular Arabic Block gives you again the textخطاب
.