How to convert '¿' special character in unix

1k views Asked by At

I have a file file.dat which has CNBC: America¿s Gun: The Rise of the AR–15

Unfortunately i got some special characters which dint converted properly in iconv function in unix.

$ file -bi file.dat

text/plain; charset=utf-8

$ cat file.dat | cut -c14 | od -x

0000000 bfc2 000a

0000003

Can you please help me out to convert the special character?

Thanks in advance

-Praveen

1

There are 1 answers

6
tripleee On

Your file is basically fine, it's in proper UTF-8 and the character you are looking at is an INVERTED QUESTION MARK (U+00BF) (though you seem to be using some legacy 8-bit character set to view the file, and the output of od -x is word-oriented little-endian, so you get the hex backwards -- the sequence is 0xC2 0xBF, not the other way around).

This article explains that when Oracle tries to export to an unknown character set, it will replace characters it cannot convert with upside-down question marks. So I guess that's what happened here. The only proper fix is to go back to your Oracle database and export in a proper format where curly apostrophes are representable (which I imagine the character really should be).

If the file came from somebody else's Oracle database, ask them to do the export again, or ask them what the character should be, or ignore the problem, or guess what character to put there, and use your editor. If there are just a few problem characters, just do it manually. If there are lots, maybe you can use context-sensitive substitution rules like

it¿s => it’s
dog¿s => dog’s
¿problem¿ => ‘‘problem’’
na¿ve => naïve
¿yri¿ispy¿rykk¿ => äyriäispyörykkä (obviously!)

The use of ¿ as a placeholder for "I don't know" is problematic, but Unicode actually has a solution: the REPLACEMENT CHARACTER (U+FFFD). I guess you're not going to like this, but the only valid (context-free) replacement you can perform programmatically is s/\u{00BF}/\u{FFFD}/g (this is Perl-ish pseudocode, but use whatever you like).