Extract arabic-numerated pages with qpdf

53 views Asked by At

The file srcfile.pdf has a variable number of roman-numerated pages (i, ii, iii, etc) and the following arabic-numerated pages (1, 2, 3, ..., n).

How to extract only arabic-numbered pages (e.g. #1 to #10)?

The following command extracts pages i, ii, iii, 1, 2, etc.

qpdf --empty --pages srcfile.pdf 1-10 -- targetfile.pdf

Is it possible to extract only pages 1, 2, 3, etc.?

1

There are 1 answers

0
wolfrevo On

qpdf has an option --json to generate a json representation of the file.

With this option there is a workaround using a json parser like e.g. jq:

With the following bash script "relative" pages in a pagelabel can be converted to absolute pages:

cat <<'EOF' | tee relpage.sh
#!/bin/bash
PDFFILE=$1
STR=$2
PAGELABENR=${STR%:*}
PAGENRINLABEL=${STR#*:}
PAGELABELINDEX=$(qpdf --json ${PDFFILE} | jq -r .pagelabels[$PAGELABENR].index)
ABSPAGENR=$(($PAGELABELINDEX+$PAGENRINLABEL))
echo $ABSPAGENR
EOF
chmod +x relpage.sh

Usage: ./relpage.sh inputfile.pdf 1:17. Note that pagelabels are 0-based.

To extract pages 17 to 39 in the pagelabel 1 use following command:

qpdf \
    --empty \
    --pages \
    ${INPUTFILE} \
    $(./relpage.sh ${INPUTFILE} 1:17)-$(./relpage.sh ${INPUTFILE} 1:39) \
    -- \
    output.pdf

To get the pagelabels info just use qpdf --json --json-key=pagelabels inputfile.pdf or the following

$ INPUTFILE=inpufile.pdf
$ PAGELABELSLENGHT=$(qpdf --json ${INPUTFILE} | jq -r '.pagelabels | length')
$ echo "file '${INPUTFILE}' has ${PAGELABELSLENGHT} pagelabels"
$ for i in $(seq 0 $((${PAGELABELSLENGHT}-1))); \
  do echo "pagelevel #$i starts at index #$(qpdf --json ${INPUTFILE} | jq -r .pagelabels[$i].index)"; \
  done

file 'inpufile.pdf' has 2 pagelabels
pagelevel #0 starts at index #0
pagelevel #1 starts at index #4