Printable characters in Mac vs Linux

574 views Asked by At

If I do this from command line on my Mac (UTF-8 in terminal and so is the file):

tr -cd '[:print:]\n' < infile > outfile

I get a different result in the outfile than I am running the same command on a Linux system (UTF-8 in terminal and so is the file).

What can be the reason for this?

This is a sample character that is still there when running the command on Mac: š (the character is an extended ASCII character 0x9A/s with caron). The same character is removed when running the command on Linux.

2

There are 2 answers

0
tripleee On

If the remaining byte is 0x9A, the file is not proper UTF-8, nor is the tool you are using to view it (0x9A is š in e.g. Windows codepage 1252) nor apparently your tr.

To properly solve your problem, we would need to see a fragment of the actual bytes in the file. For example, a file displaying as åäö could contain the bytes

0xE5 0xE4 0xF6

if it's in ISO-8859-1 (which coincides with CP1252 in these positions) or

0xC3 0xA5 0xC3 0xA4 0xC3 0xB6

if it was proper UTF-8. On OSX, an old file could also plausibly be in Mac Roman which would encode this string as

0x8C 0x81 0x9A

as well as, of course, a large number of other encodings, depending on the file's provenance.

9
b4hand On

Unfortunately, as Karol C has shown below in the tr source, it does not support Unicode at all, so the behavior on Linux for a UTF-8 file is just not going to work if the file contains any multibyte sequences.

According to this database, the U+009A character is a control character and not a printable character. The name of the character is "SINGLE CHARACTER INTRODUCER". It appears that the glyph as rendered on that page visually matches the description that you've provided, but that is not how it is being displayed on Linux. However there is another character that is "s with a caron". Unicode can be complicated.

According to Wikipedia, the "š" (s with caron) character is actually U+0161 for the lower case and U+0160 for the capital.

This also aligns with this database: