Empty PdfString decodes into a string of 64 ï characters

54 views Asked by At

https://github.com/itext/itext7/blob/develop/CONTRIBUTING.md says this is the place to report itext7 bugs, so there you go.

Observable behaviour

Using itext7 version 8.0.0

My PDF document includes several instances of /Span << /ActualText <> >>

<> is valid syntax for a hex-encoded, zero-length PdfString; however, it is parsed into a PdfString instance with content of 128 zero bytes, hexWriting=true, and value of an empty string.

For this instance, GetValue() correctly returns the empty string, but ToUnicodeString() is essentially calling PdfTokenizer.DecodeStringContent(new byte[128], true) which returns a byte[64] having every element set to 239. This is further converted into a string of 64 ï characters, which is what ToUnicodeString() returns.

As CanvasTag.GetActualText calls ToUnicodeString(), so it uses the messed-up 64-character string instead of the empty string.

Reproducer

Using C# immediate window

var str = new iText.Kernel.Pdf.PdfString("");
Expression has been evaluated and has no value
str.SetHexWriting(true);
{}
    content: null
    decryptInfoGen: 0
    decryptInfoNum: 0
    decryption: null
    directOnly: false
    encoding: null
    hexWriting: true
    indirectReference: null
    state: 0
    value: ""
str.ToUnicodeString()
"ïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïïï"

Root cause

In https://github.com/itext/itext7-dotnet/blob/develop/itext/itext.kernel/itext/kernel/pdf/PdfString.cs#L326 :

        protected internal virtual byte[] EncodeBytes(byte[] bytes) {
            if (hexWriting) {
                ByteBuffer buf = new ByteBuffer(bytes.Length * 2);
                foreach (byte b in bytes) {
                    buf.AppendHex(b);
                }
                return buf.GetInternalBuffer();
            }

This creates, for a zero-length PdfString, a new ByteBuffer(0).

In https://github.com/itext/itext7-dotnet/blob/develop/itext/itext.io/itext/io/source/ByteBuffer.cs#L39 :

        public ByteBuffer(int size) {
            if (size < 1) {
                size = 128;
            }
            buffer = new byte[size];
        }

This means that, for a zero-length PdfString, the created buffer is a byte[128].

Suggested fix

Adding if (bytes.Length == 0) return bytes; into EncodeBytes would do just fine.

Unfortunately, no workaround is possible without modifying itext7 core.

I'm not posting this as a PR because my PRs at https://github.com/itext/i7n-pdfocr/pulls haven't received any attention since 2020. Hopefully, shaped as a bug report, it may get a bit more attention from itext7 team.

0

There are 0 answers