Two word with the same representation in UTF-8 have different representation in ASCII

526 views Asked by At

I have a Farsi word that if shown in UTF-8 coding is like this:

"خطاب"

I have two versions of this word, both in Notepad++ in UTF-8 are shown as above. But if I look at them in ANSI mode then I see:

ïºïºŽï»„ﺧ

and for the other one I see:

خطاب    

How come the same words have such a different representation in ANSI format? When I use PIL in Python to draw these, the result is correct for one of these and not correct for the other.

I appreciate any help on this.

1

There are 1 answers

0
jedivader On

In Unicode you can represent some character in more than one way. In this case, these Arabic characters are represented with code points from the Arabic Presentation Forms-B Block in the first case, and with code points from the regular Arabic Block in the second case.

If you convert the text

ïºïºŽï»„ﺧ

to a byte stream, you get

EFBA0F EFBA8E EFBB84 EFBAA7

Notice that you are not seeing a character representing the 0F byte in the text above, because it's a non-visual character.

Now that byte stream is representing a UTF-8-encoded text. Decoding it will give you the following Unicode code points:

FE8F FE8E FEC4 FEA7

You can match those in the Arabic Presentation Forms-B Block to form your Farsi text:

خطاب

You can do the same process for the other text: خطاب gives you the byte stream D8AE D8B7 D8A7 D8A8, which represents UTF-8-encoded text, which decoded gives you the Unicode code points 062e 0637 0627 0628, which matched to the regular Arabic Block gives you again the text خطاب.