I'm trying that a Python program reads a word made by another Python program which was encoded to UTF-8 and saved on a txt file.

For example, the string it gets might be:

b'\xd1\x84\xd0\xbe\xd1\x80\xd1\x83\xd0\xbc'

being this a normal string, like doing this:

word_string = "b'\xd1\x84\xd0\xbe\xd1\x80\xd1\x83\xd0\xbc'"

How do I make the script see this is a bytes string and not a normal string? I know this can be done like

word_bytes = b'\xd1\x84\xd0\xbe\xd1\x80\xd1\x83\xd0\xbc'

but if I have the content of that variable 'word_bytes' already written in a file, how can I get it and make the program understand it just has to decode it? Because I try to decode it and it says it's a string and can't be decoded. Any help?

Thanks in advance!

UPDATE: So just to put here to anyone who gets the string from a file on at least Windows (I'm using Windows 7), with tripleee's answer, it will encode and put double backslashes on the bytes part, and when it decodes, it will just remove one of the backslashes, putting it as it was before. So the way to get it from a file and decode it is the following:

s = '\xd1\x84\xd0\xbe\xd1\x80\xd1\x83\xd0\xbc'.encode().decode('unicode_escape') [having the bytes part between '' been gotten from a file using the open(file,"r") function, in my case]
s.encode('latin-1').decode('utf-8') [or ISO-8859-1, as it seems it's the same thing]

EDIT: tripleee's answer is almost what I wanted to know (50% missing), but it's already a way, so thank you! But how could I do it not knowing the encoding (because in this case, I didn't know the encoding was latin-1 and I can't put all the encodings there)? Like I would do by just putting a b before the bytes string like in 'word_bytes' variable (possibly it might encode with the right encoding automatically? I wanted to do that too but possibly with a funcion to a variable that has already the bytes part).

2 Answers

2
tripleee On Best Solutions

If you have the bytes in a variable already, you are all set. If you have the bytes in a string, I'm assuming you basically have a sequence of characters where the code point value of each is equivalent to the byte value it's supposed to hold. This happens to be the definition of the Latin-1 encoding - it feels a bit dirty, but the trick is to encode your string as Latin-1, then decode back as UTF-8.

>>> s = '\xd1\x84\xd0\xbe\xd1\x80\xd1\x83\xd0\xbc'
>>> s.encode('latin-1').decode('utf-8')
'форум'
0
abssab On

you can identify if the string is in bytes using

def identifystring(string):
    if isinstance(string, str):
        print ("ordinary string")
    elif isinstance(string, unicode):
        print ("unicode string")
    else:
        print ("no string")