Determining ISO-8859-1 vs US-ASCII charset

15.2k views Asked by At

I am trying to determine whether to use

PrintWriter pw = new PrintWriter(outputFilename, "ISO-8859-1");

or

PrintWriter pw = new PrintWriter(outputFilename, "US-ASCII");

I was reading All about character sets to determine the character set of an example file which I must create in the same encoding via java code.

When my example file contains "European" letters (Norwegian: å ø æ), then the following command tells me the file encoding is "iso-8859-1"

file -bi example.txt

However, when I take a copy of the same example file and modify it to contain different data, without any Norwegian text (let's say, I replace "Bjørn" with "Bjorn"), then the same command tells me the file encoding is "us-ascii".

file -bi example-no-european-letters.txt

What does this mean? Is ISO-8859-1 in practise the same as US-ASCII if there are no "European" characters in it?

Should I just use a charset "ISO-8559-1" and everything will be ok?

2

There are 2 answers

0
Kayaman On BEST ANSWER

If the file contains only the 7-bit US-ASCII characters it can be read as US-ASCII. It doesn't tell anything about what was intended as the charset. It may be just a coincidence that there were no characters that would require a different coding.

ISO-8859-1 (and -15) is a common european encoding, able to encode äöåéü and other characters, the first 127 characters being the same as in US-ASCII (as often is, for convenience reasons).

However you can't just pick an encoding and assume that "everything will be OK". The very common UTF-8 encoding also contains the US-ASCII charset, but it will encode for example äöå characters as two bytes instead of ISO-8859-1's one byte.

TL;DR: Don't assume things with encodings. Find out what was intended and use that. If you can't find it out, observe the data to try to figure out what is a correct charset to use (as you noted yourself, multiple encodings may work at least temporarily).

0
Kaliappan On

It depends on different types of characters we use in the respective document. ASCII is 7-bit charset and ISO-8859-1 is 8-bit charset which supports some additional characters. But, mostly, if you are going to reproduce the document from inputstream, I recommend the ISO-8859-1 charset. It will work for textfile like notepad and MS word.

If you are using some different international characters, we need to check the corresponding charset which supports that particular character like UTF-8..