.net DeflateStream fails to decode a PDF stream object content

55 views Asked by At

I created a PDF using the export function of Word 2019.

I'm parsing the PDF as a text and I capture the content of the stream object:

4 0 obj
<</Filter/FlateDecode/Length 344>>
stream
xœµRMKA½/ÌÈqWè˜dçÊBK?Ä‹hăØu)Ô•¶þL»[¨¶=¨íÂlHæ%yy™¥JPãæÁ ØhuÎéȰ*UòxµJúS•\Øëè`ú¦’
À³F6àM®YnÞ7žBµ–ÚP5.µîX%O]Dg
²]$;((н‚œX7j|‡y÷ƒÏ¡`Ü‹ïòlo‹ßÏÛàˆÚø.Ÿ° /¶oÛ¸ý–GÞÜÖ}†é­J†2ö½J–¿“‰È”£ÕLû2mÅi$I!ûO¿ˆ4š£ka{þ~dO<ƶŸ¦` ¥N”¿
VÕ±èÖÑMù2›×YǤdyzW—g§HV2þÎðb;rÞkov4šg&]­?!ëpº˜‹(µœRŸëx’ÆåÆ–uP8è7)_?²ŽÝ;“yæ"s³Ót’ǹ¿ ¹ÿy
endstream
endobj

I want to decode the FlateEncoded stream, so I wrote this function:

let decodeFlateDecodeStream(encodedStream: string) =
    // Convert the encoded stream from a string to a byte array.
    let encodedBytes = Encoding.UTF8.GetBytes(encodedStream)

    // Create a MemoryStream from the byte array.
    use inputStream = new MemoryStream(encodedBytes)

    // Create a MemoryStream to store the decompressed data.
    use outputStream = new MemoryStream()

    // Create a DeflateStream to decompress the input stream.
    use deflateStream = new DeflateStream(inputStream, CompressionMode.Decompress)

    // Copy the decompressed data to the output stream.
    deflateStream.CopyTo(outputStream)

    // Convert the decompressed stream to a string (assuming it contains text).
    Encoding.UTF8.GetString(outputStream.ToArray())

When I pass the stream content ("xœµRMK.....¹¿ ¹ÿy"), readed as a string, to the function it fails on doing deflateStream.CopyTo(outputStream) with this error:

System.IO.InvalidDataException: The archive entry was compressed using an unsupported compression method.

The PDF documentation says it is a RFC 1951 Flate compression.
Is no the .net DeflateStream capable to do it, or I'm not doing the right thing?

I found posts of 10 years ago saying the DeflateStream does not work properly and suggesting to use a third party library. DotZipLib is too old. I think it, or the other one I found, are not .net 7 compatible. I believe I can use .net and what already it provides. I tried to skip the first 2 bytes, because some posts claim there are 2 bytes head added to the encoded data, to make it compatible with RFC 1950 (I haven't found this claimed compatibility in hte PDF reference, it says that for Flate is using RFC 1951). Still not working.

[UPDATE]
I believe the error is in hte initial parse of the PDF itself.
I'm reading it as text with: let lines = File.ReadAllLines pdfFile and then concatenating the lines between the "strem" and "endstream" markers. I realized that I was losing 3 "newline" characters due to the read line by line. I added an extra "/n" character in between the concatenated lines of text, but still the resulting encoded bytes array is only long 334 instead of 344 .

I think I have to read the PDF as a binary file... to avoid UTF8 re-encoding issue as indicated in comment ?

0

There are 0 answers