Using `xmllint` on UTF-16 (little-endian) XML

1.1k views Asked by At

I am working on a binary file. Within this binary file, I can extract what seems to be a UTF-16 XML (little-endian) file.

If I extract the data, and try to dump it from the console (running debian/jessie amd64), here is what I get:

$ xmllint --format D5905822-DFF9-7944-9CFE-258264B8162E.UNK
D5905822-DFF9-7944-9CFE-258264B8162E.UNK:1: parser error : Char 0x0 out of allowed range
<
 ^
D5905822-DFF9-7944-9CFE-258264B8162E.UNK:1: parser error : StartTag: invalid element name
<
 ^

I could not find anything from the man page of xmllint to help me out, so I downloaded from the net a UTF-16 Little endian file, from here. I removed the actual XML data, to only keep the first line (the encoding):

$ cat header
��<?xml version="1.0" encoding="UTF-16"?>

$ hexdump header
0000000 feff 003c 003f 0078 006d 006c 0020 0076
0000010 0065 0072 0073 0069 006f 006e 003d 0022
0000020 0031 002e 0030 0022 0020 0065 006e 0063
0000030 006f 0064 0069 006e 0067 003d 0022 0055
0000040 0054 0046 002d 0031 0036 0022 003f 003e
0000050 000d 000a                              
0000054

And now I can use xmlling properly:

$ cat header D5905822-DFF9-7944-9CFE-258264B8162E.UNK > bla.xml
$ xmllint --format bla.xml
��<?xml version="1.0" encoding="UTF-16"?>
<InteractiveMeasurement>
  <InteractiveMeasurementRecord ElementUniqueName="f0c9b1c6-9a5c-40cd-8303-e507bb539cdc" IsValid="true">
[...]

Isn't there any other easier solutions ? Why is this so complex to read UTF-16 Little-endian XML files ?

1

There are 1 answers

0
Pekka On BEST ANSWER

The XML C parser and toolkit of Gnome Encodings support indicates this behaviour is by design and the author questions why anyone would want anything else. XMLLint provides a parameter for output encoding but does not do so for the input.

It looks like it would be possible to extend the parser with a further encoding but this may not get past the default heuristics.