pdftk generated pdf does not render correct utf-8

62 views Asked by At

Version is 2.02

I have a simple pdf with a name field. I created an fdf using pdftk

pdftk form4.pdf generate_fdf output data4.fdf

Removing the unnecessary fields, here is what it has: (full view on pastebin)

/Fields [
<<
/V (testname)
/T (name)
>>]

I then modify the fdf and change testname to testϢname (U+03E2) (used vim with utf-8 encoding on, and also cat to view its correct in terminal)

I then attempt to generate a pdf using

pdftk form4.pdf fill_form data4.fdf output form5.pdf need_appearances

Viewing the form (bottom image), I see different characters. I viewed this on MacOS preview, Chrome, and Acrobat. I also used flatten but the result was the same. This is an Arial unicode character. enter image description here

enter image description here

1

There are 1 answers

1
K J On

Simply calling a character Arial does not make it happen, unless the font is "inside" the PDF and since you don't know if the character used by a client will be American or Armenian or Asian etc. You would need to include every font based character from around the World. So usually fields have no real characters other than initially "No font / Fontless". It is on save, that the PDF re-writer like Acrobat Reader (Acrobat trimmed down editor), adds the fonts into the file.

Your form had 20 object entries we can trim that down by about 25% in the count to 15 active and lose nothing in size but make it cleaner. However here are the key entries as were, 6 0 of 20, is the page. /TLBZsrqIpt is the field Xobject relative to bottom left and then the Font /FTPLWNbykz used as plain text for 14pt (name) the brackets mean (plain literal ASCII bytes)

6 0 obj
<</Length ~90>>
stream
q /TLBZsrqIpt Do Q
q 0 g BT /FTPLWNbykz 14 Tf 1 0 0 1 199 689.154 Tm [(name)] TJ ET Q
endstream
endobj

So let's clear up first, that the fixed page text is NOT Arial. So no font needed as it is stock Helvetica. Thus any "Swiss Style" characters 32-128 and a few more allowed.

Here "Windows ANSI" single bytes but could have been Mac ANSI encoding.

/Font<</FTPLWNbykz 4 0 R>>

4 0 obj <</Type/Font/Subtype/Type1/BaseFont/Helvetica/Encoding/WinAnsiEncoding>> endobj

So what about the field placed before it? What font does that embed ?

2 0 obj <</AP<</N 16 0 R>>/DA(/Helv 14 Tf 0 g)/F 4/FT/Tx/Ff 4194304/MK<<>>/P 1 0 R/Q 0/Rect[200.5 650 300.5 675]/Subtype/Widget/T(name)/TM(NySwFduhG)/V(testname)>> endobj

Thus we see it will also simply use a 14 point abbreviated format. Of Helvetic Swiss Style Latin single byte font /Helv 14. Also it is only expecting to be replaced with (plain text).

Therefore nothing so far in the file, allows for any "UTF-16" character encryption.

Once Acrobat "Editing" Reader, saves it. It changes. Now there are no longer 15 working objects, but a count of about 24 in a file of 33 (about 8 are redundant entries). However the file is now roughly 8 times bigger as all the fonts for that one character need to be embedded. You won't see the font listed here as it's not part of the simple "Helvetica" name hinted by the page text.

enter image description here

So how is Acrobat changing from plain text to a "font based" field entry?

Here it is and we see the (ANSI) has been changed programmatically to 16 bit HeX i.e. . /V<FEFF007400650073007403E2006E0061006D0065>.

2 0 obj <</AP<</N 21 0 R>>/DA(/Helv 14 Tf 0 g)/F 4/FT/Tx/Ff 4194304/MK<<>>/P 1 0 R/Q 0/Rect[200.5 650 300.5  675]  
/Subtype/Widget/T(name)/TM(NySwFduhG)/V<FEFF007400650073007403E2006E0061006D0065>>> endobj

Answer

The FDF will need to use /HexAscii equivalence of UTF16 here version 1.4 will do that.

%FDF-1.4
%âãÏÓ
1 0 obj
<</FDF <</F (form4.pdf)/Fields [<</T (name)/V<FEFF007400650073007403E2006E0061006D0065>
>>]/ID [<476478A7B9CD10F87E102A53C18E662F><6F267D21B03A414390674770CB51E71C>]/UF (form4.pdf)>>/Type /Catalog>>
endobj
trailer
<</Root 1 0 R>>
%%EOF

We can click the file and Acrobat accept it is valid as a substitution:

enter image description here

However without a related embedded font, there is no assurance it will work on all devices, in all languages. Acrobat will bloat the compressed 7 KB file to 26 KB to store that single character, so it should work via Acrobat FDF tools.

enter image description here enter image description here

In comments it was pointed out that the older format that PDFTK uses is not compatible with modern Acrobat versions which are base-lined on %FDF-1.4 +.

So using the PDFTK methods we can generate a field number for those characters that in turn Acrobat will use local fonts to display like this. One issue I foresee is that when the field is built it needs to be Flagged as a "Rich Text" field by the field generator.

enter image description here

However the field must NOT be flattened until after save by Acrobat Reader otherwise without the Acrobat loaded supporting font(s), at best it will look like this

enter image description here

Once those characters are embedded into the file those included can be used before flattening. P.S. you do not need to flatten fields for readers except to lock the data but it will still be editable anyway. This reader does not care if the field is flat or not simply that the Coptic shapes are embedded as a font.

enter image description here enter image description here