Handle mixed charsets in the same json file

44 views Asked by At

Given I have the following file:

{
  "title": {
    "ger": "Komödie"    (utf8, encoded as c3 b6)
  },
  "files": [
    {
      "filename": "Kom�die"   (latin1, encoded as f6)
    }
  ]
}

(might look differently if you try to copy-paste it)

This happened due to an application bug, I cannot fix the source which generates these files.

I try now to fix the charset of the filename field(s), there can be multiple of them. I tried first with jq (single field):

value="$(jq '.files[0].filename' <in.txt | iconv -f latin1 -t utf-8)"
jq --arg f "$value" '.files[0].filename = $f' <in.txt

But jq interprets the whole file as utf-8 and this damages the single f6 character.

I would like to find a solution in python, but also there, the input is by default interpreted as utf-8 in linux. I tried with 'ascii', but this doesn't allow characters >= 128.

Now, I think I found a way, but the json serializer escapes all characters. As I (intentionally) work with the wrong character set, the escaped sequence is also garbage.

#!/usr/bin/python3

import sys
import io
import json

with open('in.txt', encoding='latin1') as fh:
  j = json.load(fh)

for f in j['files']:
  f['filename'] = f['filename'].encode('utf-8').decode('latin1')   # might be wrong, couldn't test

with open('out.txt', 'w', encoding='latin1') as fh:
  json.dump(j, fh)

What can I do to achieve the expected result, a clean non-escaped utf-8 json file?

1

There are 1 answers

0
Daniel Alder On

A workaround I found so far is not using json at all. Instead, I only process those line wich are known to have the wrong character set. This works because I know the files are very well-structured and formatted and I can easily find the broken lines by static text patterns. Note that I'm still working with the wrong charset. But so far, opening utf-8 as latin1 and saving it as lating1 again hasn't damaged anything of the utf-8 parts.

#!/usr/bin/python3

with open('in.txt', encoding='latin1') as fh:
  lines = fh.readlines()

# change the encoding of all filename: lines and keep everything else
lines = [line.encode('utf-8').decode('latin1') if line.startswith("\t\t\t\"filename\": \"") else line for line in lines]

with open('out.txt', 'w', encoding='latin1') as fh:
    fh.writelines(lines)

output:

{
  "title": {
    "ger": "Komödie"
  },
  "files": [
    {
      "filename": "Komödie"
    }
  ]
}