If I do this from command line on my Mac (UTF-8 in terminal and so is the file):
tr -cd '[:print:]\n' < infile > outfile
I get a different result in the outfile than I am running the same command on a Linux system (UTF-8 in terminal and so is the file).
What can be the reason for this?
This is a sample character that is still there when running the command on Mac: š (the character is an extended ASCII character 0x9A/s with caron). The same character is removed when running the command on Linux.
If the remaining byte is 0x9A, the file is not proper UTF-8, nor is the tool you are using to view it (0x9A is š in e.g. Windows codepage 1252) nor apparently your
tr
.To properly solve your problem, we would need to see a fragment of the actual bytes in the file. For example, a file displaying as
åäö
could contain the bytesif it's in ISO-8859-1 (which coincides with CP1252 in these positions) or
if it was proper UTF-8. On OSX, an old file could also plausibly be in Mac Roman which would encode this string as
as well as, of course, a large number of other encodings, depending on the file's provenance.