Processing Sinhala letters with \u200d zero-width joiner in Python 3.7.13

199 views Asked by At

The following issue occurs when returning Sinhala sentences with \u200d joiner.

def test_function():
  return 'ෆ්‍රෙන්ස්!' #frens

test_function()

The output is returned as 'ෆ්\u200dරෙන්ස්!'(=f\u200drens) not 'ෆ්‍රෙන්ස්!'(=frens!). This issue does not exist when printing to the terminal or writing to a file. But occurs when assigned to a variable.

I tried encode().decode(), unicodedata.normalize which does not resolve the issue. The closest I could get was Printing family emoji, with U+200D zero-width joiner, directly, vs via list where I realised that the issue is possibly due to the \u200d joiner.

Note: The code was tested on Colab.

Thank you so much in advance.

1

There are 1 answers

0
furas On

It is not issue, and it is not problem with variables but it is only how some tools display values.
They use repr() to create output more useful for debuging.

If you use

print( test_function() )

result = test_function()
print(result)

then you should see expected 'ෆ්‍රෙන්ස්!'

And if you compare values `

test_function() == 'ෆ්‍රෙන්ස්!'

then you should get True

But if you use inteactive mode in Python (or ipython, notebook, jupyte, Google Colab, etc.) which automatically displays value then it uses print(repr(...)) to display it and you can see 'ෆ්\u200dරෙන්ස්!'

You get the same result using manually

print(repr('ෆ්‍රෙන්ස්!'))

print(repr( test_function() ))

interactive mode in Python was created to test single command and it displays result with repr() because it is more useful for debuging/testing code.


So it is NOT problem with variables and you DON'T have to convert it.
You have to only manually use print() to display values.


Similar problem is with printing list.

To print list it has to convert to string and it also use repr() to create result more useful for debuging (or to create string which you can later use with eval() to recreate list)

If you want to see list with correct strings then you have to manually convert it to string

data = ['ෆ්‍රෙන්ස්!']

text = ",".join(data)

print(text)