How do I process a DXL filedata element when the file encoding is "none"?

331 views Asked by At

I'm trying to extract attachments from Domino documents which were exported to DXL (Domino XML schema). For elements with encoding="base64" I can handle the filedata content with ease. However, most of the files have encoding="none" — which logically should mean direct embedding — but the container does not have a readable text, rather contains 76-character lines much similar to base64 encoding. They are not valid base64 or uuencoded info, nor anything I can recognize. Does anyone know what sort of arcane encoding is this one IBM calls "none"? A typical segment looks like this:

<file hosttype='msdos' compression='none' flags='sign storedindoc' encoding='none' 
name='myfilename.doc' size='50688' storagesize='32519' desiredcompression='huffman'>
<created><datetime dst='true'>20061110T193351,87-02</datetime></created>
<modified><datetime dst='true'>20061110T193351,73-02</datetime></modified><filedata>
0M8R4KGxGuEAAAAAAAAAAAAAAAAAAAAAPgADAP7/CQAGAAAAAAAAAAAAAAABAAAAXgAAAAAAAAAA
EAAAYAAAAAEAAAD+////AAAAAF0AAAD/////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////

(it goes on for hundreds of lines... up to)

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAA
</filedata></file> 

It looks like some MIME encodign but it is not base64. The number of bits do not add up end the decoder fails. (Yes, I removed the NLs from the parser feed.)

How to decode something which is suposedly not encoded? (According to the IBM magi.)

[post-script] I realized that the document does not conform to the DXL DTD, i.e. it is parseable but does not validate. Also, although encoding="none" the filedata content is indeed base64, although no necessarily padded with '='s at the end. Also, the XML SAX parser was passing me chunks of the text content instead of entire lines. Since base64 needs multiples of 4 characters to operate on (generating 3 bytes), it messed up the decoding. If I ignore the DTD and force a carefully buffered base64 decoding, even when @encoding != "base64" (by the DTD), then all goes well. Looks like IBM does not care following its own DTDs.

0

There are 0 answers