Is there a way to remove fonts embedded multiple time from a pdf file?
This is my scenario:
1) a program generates several one-page pdf reports (querying a db, putting the info on an excel template and exporting the formatted information in pdf)
2) pdftk merges the single-page pdfs in one file.
Everything works fine, but the size of the resulting pdf is very large: in fact, I noticed that the fonts are embedded multiple times (as many time as the number of the page: all pages are generated starting from the same excel template, the fonts are embedded in the single pdf file and pdftk just glues the pdf). Is there a way to keet just one copy of each embedded font?
I tried to embed the fonts just in the first page while exporting from excel->pdf: the size of the file decreases dramatically, but it seems that the other pages can't access the embedded fonts.
Thanks, Alessandro
You could try to 'repair' your pdftk-concatenated PDF using Ghostscript (but use a recent version, such as 9.05). In many cases Ghostscript will be able to merge the many subsetted fonts into fewer ones.
The command would look like this:
Check with
how many instances of various font subsets are in each file (
pdffonts.exe
is available here as part of a small package of commandline tools).But don't complain about the 'slow speed' of this process -- Ghostscript does interprete completely all PDF input files to accomplish its task, while the pdftk file concatenation is a much simpler process...
Update:
Instead of
pdftk
you could use Ghostscript to merge your input PDF files. This could possibly avoid the problem you was seeing with the a posteriori Ghostscript 'repair' of your pdftk-merged files. Note, this will be much slower than the 'dumb' pdftk merge. However, the results may please you better, especially regarding the font handling and file size.This would be a possible command:
You can add more options to the Ghostscript CLI for a more fine-tuned control over the merge and optimization process.
In the end you'll have to decide between the extremes:
pdftk
producing large output files, vs.gswin32c.exe
(Ghostscript) producing lean output files.I'd be interested if you would post some results (execution time and resulting file sizes) for both methods for a number of your merge processes...
Update 2: Sorry, my previous version contained a typo.
It's not
-sPDFSETTINGS=...
but it must be-dPDFSETTINGS=...
(d in place of s).Update 3:
Since your source files are Excel sheets made from templates (which usually don't use a lot of different fonts), you could try to use a trick to make sure Ghostscript has all the required glyphs of the fonts used in all to-be-merged-later PDFs:
0123456789
,ABCD...XYZ
,abc...xyz
,:-_;°%&$§")({}[]
etc.This method will hopefully make sure that each of your PDFs will use the same subset of glyphs which would then avoid the problems you observed when merging the files with Ghostscript. (Note, that you if you use f.e. Arial and Arial-Italic, you have to create 2 such cells: one formatted with the standard Arial typeface, the other one with the italic one.)