How to exclude BOM with BOM InputStream

13.2k views Asked by At

I am trying to figure out how to simply exclude the BOM while using the example given by Apache. I am reading a file from Internal Storage and converting it first into a String. Then I convert it into ByteArray so that I get an InputStream. Then I check with BOMInputStream for BOMs, since I had errors for "Unexpected Tokens". Now I don't know how to exclude the BOM if I have it.

CODE:

StringBuffer fileContent = new StringBuffer("");
String temp = "";
int ch;
try{
    FileInputStream fis = ctx.openFileInput("dataxml");
try {
    while( (ch = fis.read()) != -1)
        fileContent.append((char)ch);
        temp = temp + Character.toString((char)ch);
} catch (IOException e) {
    e.printStackTrace();
}
} catch (FileNotFoundException e) {
    e.printStackTrace();
}


InputStream ins = new ByteArrayInputStream(temp.getBytes(StandardCharsets.UTF_8));
BOMInputStream bomIn = new BOMInputStream(ins);
if (bomIn.hasBOM()) {
    // has a UTF-8 BOM

}

xpp.setInput(ins,"UTF-8");
parseXMLAndStoreIt(xpp);
ins.close();

The filename is "dataxml", which I store in different Class with openFileOutput.

3

There are 3 answers

3
eric_the_animal On

I've never used BOMInputStream before but to exclude a byte order mark from the stream you'd just have to read starting at an offset that is one greater than the location of the end of the BOM. Does BOMInputStream have a property indicating the location of the BOM? Also, you can have a look here: http://www.rgagnon.com/javadetails/java-handle-utf8-file-with-bom.html

0
Alban On

You are building a String reading characters from an InputStream disregarding BOM and encoding. The way you read characters from the steam converting one byte to one character is bad, very bad. Please use any implementation of Reader (specifying the encoding) to read characters from a sequence of bytes.

Later you convert the String back to bytes (and there you take care specifying the encoding. If you compare the sequence of byte you obtain at this point, it is probably very different than the one you fetched from your store.

2
user2910552 On

You can wrap initial stream in BOMInputStream:

    InputStream stream = new BOMInputStream(inputStream);
    // code using stream goes here

This way stream skips BOM prefix automagically. BOMInputStream lives in Apache Commons IO library.