The byte order mark (BOM) for UTF-8 is EF BB BF
, as noted in section 23.8 of the Unicode 9 specification (search for "signature").
Many solutions in Java to remove this is just a simple one-line code:
replace("\uFEFF", "")
I don't understand this why this works.
Here is my test code. I check the binary after calling String#replace
where I find that EF BB BF is INDEED removed. See this code run live at IdeOne.com.
So magic. Why does this work?
@Test
public void shit() throws Exception{
byte[] b = new byte[]{-17,-69,-65, 97,97,97};//EF BB BF 61 61 61
char[] c = new char[10];
new InputStreamReader(new ByteArrayInputStream(b),"UTF-8").read(c);
byte[] bytes = new StringBuilder().append(c).toString().replace("\uFEFF", "").getBytes();//
for(byte bt: bytes){//61 61 61, we can see EF BB BF is indeed removed
System.out.println(bt);
}
}
The reason is that a unicode text should start with the byte order mark (except UTF-8 where it is not
recommendedmandatory[1]).from Wikipedia
Which means this special character (
\uFEFF
) must also be encoded in UTF-8.UTF-8 can encode Unicode code points in one to four bytes.
0xxx xxxx
110x xxxx
means the encoding is represented by two bytes, continuation bytes always start with10xx xxxx
(thex
bits can be used for the code points)The code points in the range
U+0000 - U+007F
can be encoded with one byte.The code points in the range
U+0080 - U+07FF
can be encoded with two bytes. The code points in the rangeU+0800 - U+FFFF
can be encoded with three bytes.A detailed explanation is on Wikipedia
For the BOM we need three bytes.
encode the bits in UTF-8
EF BB BF
sounds already familiar. ;-)The byte sequence
EF BB BF
is nothing else than the BOM encoded in UTF-8.As the byte order mark has no meaning for UTF-8 it is not used in Java.
encoding the BOM character as UTF-8
Hence when the file is read the byte sequence gets decoded to
\uFEFF
.For encoding e.g. UTF-16 the BOM is added
[1] cited from: http://www.unicode.org/versions/Unicode9.0.0/ch23.pdf