django countries encoding is not giving correct name

913 views Asked by At

I am using django_countries module for countries list, the problem is there are couple of countries with special characters like 'Åland Islands' and 'Saint Barthélemy'.

I am calling this method to get the country name:

country_label = fields.Country(form.cleaned_data.get('country')[0:2]).name

I know that country_label is lazy translated proxy object of django utils, but it is not giving the right name rather it gives 'Ã…land Islands'. any suggestions for this please?

3

There are 3 answers

0
Edwin Lunando On

Just this this week I encountered a similar encoding error. I believe the problem is because the machine encoding is differ with the one on Python. Try to add this to your .bashrc or .zshrc.

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8

Then, open up a new terminal and run the Django app again.

1
alexisdevarennes On

try:

from __future__ import unicode_literals #Place as first import.

AND / OR

country_label = fields.Country(form.cleaned_data.get('country')[0:2]).name.encode('latin1').decode('utf8')
5
Julien Grégoire On

Django stores unicode string using code points and identifies the string as unicode for further processing. UTF-8 uses four 8-bit bytes encoding, so the unicode string that's being used by Django needs to be decoded or interpreted from code point notation to its UTF-8 notation at some point. In the case of Åland Islands, what seems to be happening is that it's taking the UTF-8 byte encoding and interpret it as code points to convert the string.

The string django_countries returns is most likely u'\xc5land Islands' where \xc5 is the UTF code point notation of Å. In UTF-8 byte notation \xc5 becomes \xc3\x85 where each number \xc3 and \x85 is a 8-bit byte. See: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=xc5&mode=hex

Or you can use country_label = fields.Country(form.cleaned_data.get('country')[0:2]).name.encode('utf-8') to go from u'\xc5land Islands' to '\xc3\x85land Islands'

If you take then each byte and use them as code points, you'll see it'll give you these characters: Ã… See: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=xc3&mode=hex And: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=x85&mode=hex

See code snippet with html notation of these characters.

<div id="test">&#xC3;&#x85;&#xC5;</div>

So I'm guessing you have 2 different encodings in you application. One way to get from u'\xc5land Islands' to u'\xc3\x85land Islands' would be to in an utf-8 environment encode to UTF-8 which would convert u'\xc5' to '\xc3\x85' and then decode to unicode from iso-8859 which would give u'\xc3\x85land Islands'. But since it's not in the code you're providing, I'm guessing it's happening somewhere between the moment you set country_label and the moment your output isn't displayed properly. Either automatically because of encodings settings, or through an explicit assignation somewhere.

FIRST EDIT:

To set encoding for you app, add # -*- coding: utf-8 -*- at the top of your py file and <meta charset="UTF-8"> in of your template. And to get unicode string from a django.utils.functional.proxy object you can call unicode(). Like this:

country_label = unicode(fields.Country(form.cleaned_data.get('country')[0:2]).name)

SECOND EDIT:

One other way to figure out where the problem is would be to use force_bytes (https://docs.djangoproject.com/en/1.8/ref/utils/#module-django.utils.encoding) Like this:

from django.utils.encoding import force_bytes
country_label = fields.Country(form.cleaned_data.get('country')[0:2]).name
forced_country_label = force_bytes(country_label, encoding='utf-8', strings_only=False, errors='strict') 

But since you already tried many conversions without success, maybe the problem is more complex. Can you share your version of django_countries, Python and your django app language settings? What you can do also is go see directly in your djano_countries package (that should be in your python directory), find the file data.py and open it to see what it looks like. Maybe the data itself is corrupted.