The byte order mark (BOM) for UTF-8 is EF BB BF, as noted in section 23.8 of the Unicode 9 specification (search for "signature").
Many solutions in Java to remove this is just a simple one-line code:
replace("\uFEFF", "")
I don't understand this why this works.
Here is my test code. I check the binary after calling String#replace where I find that EF BB BF is INDEED removed. See this code run live at IdeOne.com.
So magic. Why does this work?
@Test
public void shit() throws Exception{
byte[] b = new byte[]{-17,-69,-65, 97,97,97};//EF BB BF 61 61 61
char[] c = new char[10];
new InputStreamReader(new ByteArrayInputStream(b),"UTF-8").read(c);
byte[] bytes = new StringBuilder().append(c).toString().replace("\uFEFF", "").getBytes();//
for(byte bt: bytes){//61 61 61, we can see EF BB BF is indeed removed
System.out.println(bt);
}
}
InputStreamReader is decoding the UTF-8 encoded byte sequence (b) into UTF-16BE, and in the process translates the UTF-8 BOM to UTF-16BE BOM (\uFEFF). UTF-16BE is selected as the target encoding because Charset defaults to this behavior:
https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html
See JLS 3.1 to understand why the internal encoding of String is UTF-16:
https://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#jls-3.1
String#getBytes() returns a byte sequence in the platform's default encoding, which appears to be UTF-8 for your system.
Summary
The sequence EF BB BF (UTF-8 BOM) is translated to FE FF (UTF-16BE BOM) when decoding the byte sequence into a String using InputStreamReader, because the encoding of java.lang.String with a default Charset is UTF-16 BE in the presence of a BOM. After replacing the UTF-16BE BOM and calling String#getBytes() the string is decoded into UTF-8 (the default charset for your platform) and you see your original byte sequence without a BOM.