Using sed to fix pdf files

466 views Asked by At

I am running GNU sed version 4.2.1 on windows. I have a huge number of PDF files having %%EOF + newline + a lot of NUL chars in the last record.

See hexdump below.

0000b890: 25 25 45 4F 46 0D 0A 00 - 00 00 00 00 00 00 00 00 |%%EOF           
|

0000b8a0: 00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00 |                
|

I need to change the last record to be %%EOF only. The expression ^%%EOF\x0d\x0a\x0{10,30000} matches the characters in Notepad++, but it seems it does not work in sed. Is anyone able to help? Many thanks.

1

There are 1 answers

2
Stefan Hegny On

Assuming your grep supports it, for a given input.pdf do

Read the byte offset of the last %%EOF in the file into the variable offset

offset=$( grep -a -b '%%EOF' input.pdf  | tail -1 | cut -d: -f1 )

cut off the first offset + 5 bytes (the length of the string %%EOF) from the original file, then the output.pdf should be what you wanted

head -c$(( $offset + 5 )) input.pdf > output.pdf

But depending on the nature of the PDF (e.g. no %%EOF at all at the end, (edit: or other data but null bytes following the %%EOF[thx @mkl) this might behave different from what you want or cause a lot of other problems.