Saving HTML email with inline images to PS or PDF with Procmail

425 views Asked by At

I need a little advice/push in the right direction.

I have written some small scripts that takes an incoming HTML email, converts it to PostScript and then sends it to a designated printer via CUPS. Printers are based on the recipient of the email.

I am using the following to achieve this;

  1. Exim
  2. Procmail
  3. HTML2PS
  4. Two custom scripts (posted below)

The flow

  1. An email is received by Exim and passed to Procmail
  2. .procmailrc calls the custom script "process_mail", passing the subject and content as parameters
  3. "process_mail" pulls the content into a function and calls "get_html_from_message" (I am not doing anything with the subject yet)
  4. "get_html_from_message" dumps everything but the HTML
  5. HTML is then converted to PostScript
  6. PostScript file is sent to designated printer.

Problems

  1. At the HTML2PS stage an error is generated and an NDR is sent back to the sender stating that there was an error opening the images. Error opening cid:logo.jpg
  2. PostScript file is successfully printed but obviously does not contain the images from the email.

My question is: How do I get those images out of the email so that they will print out successfully in the PostScript file?

I am more than happy to convert to PDF if PostScript is not suitable, but even converting to PDF leaves me without the images because I cannot get at them.

.procmailrc

SHELL=/bin/bash

# Extract the subject and normalise
SUBJECT=`formail -x"Subject: "\
| /usr/bin/tr '[:space:][:cntrl:][:punct:]' '_' | expand | sed -e     's/^[_]*//' -e 's/[_]*$//'`
YMD=`date +%Y%m%d`

MAKE_SURE_DIRS_EXIST=`
mkdir -p received_mail/backup
if [ ! -z ${SUBJECT} ]
then
    mkdir -p received_mail/${YMD}/${SUBJECT}
else
    mkdir -p received_mail/${YMD}/no_subject
fi
`

# Backup all received mail into the backup directory appending to a file named by date
:0c
received_mail/backup/${YMD}.m

# If no subject, just store the mail
:0c
* SUBJECT ?? ^^^^
received_mail/${YMD}/no_subject/.

# Else there is a subject, generate a unique filemane, place the received email 
# in that file and then execute process_mail passing the filename and subject as parameters
:0Eb
| f=`uuidgen`; export f; cat > received_mail/${YMD}/${SUBJECT}/${f};     $HOME/bin/process_mail received_mail/${YMD}/${SUBJECT}/${f} "${SUBJECT}"

# and don't deliver to standard mail, don't want to clutter up the inbox. 
:0
/dev/null

process_mail

#/bin/bash

# Test Printer
printer=$(whoami)

file=$1
subject=$2

function process_rrs {
typeset file
file=$1
cat $file \
| $HOME/bin/get_html_from_message \
| html2ps \
| lp -d ${printer} -o media=a4 2>&1
}

case "$subject" in
*)
    process_rrs $file
    ;;
esac

get_html_from_message

cat | awk '
BEGIN {
typeout=0
}
{
if($0 ~ /<html/)
    typeout=1
if($0 ~ /^------=/)
    typeout=0
if(typeout)
    print $0
}'

EDIT: Formatting

2

There are 2 answers

3
tripleee On

The problem is probably an incomplete understanding of how HTML is represented in email. There will typically be a MIME multipart with one HTML part and multiple images. The HTML uses the cid: addressing scheme in image links to refer to these sibling parts. But if you extract just the HTML, it no longer exists in a context where it has any siblings. (Even if you extract all parts to files, cid: does not normally map to a local file. Maybe you could post-process the HTML to fix that; but I'm thinking maybe your approach should be rethought. Have you considered using a mail client with native HTML support for rendering these messages?)

A simple xmlstarlet script or similar to strip the cid: prefix from the src attribute of any img link should not be hard, but there are probably additional things you discover you need to do if you attempt this path.

1
Soddengecko On

I have figured out how to achieve this. The details are below. All of this is running on two load balanced CentOS 6 boxes.

Applications

  1. Exim
  2. CUPS
  3. Mhonarc (not in the repo's. RPM and website here https://www.mhonarc.org/)
  4. Procmail
  5. Html2ps

How it works

  1. An email is sent to a user account that exists on both boxes.
  2. Exim pipes that email to Procmail
  3. Procmail looks for .procmailrc in the users home directory and takes action.
  4. Mhonarc converts the email to a HTML file, saving the images and attachments.
  5. Using "sed", open the HTML file and look for the start of the email () and collect all text to the end of file
  6. Pipe to "sed" again to remove superfluous HTML tags (hr tags) added by Mhonarc
  7. Pipe to Html2ps to convert to PostScript
  8. Pipe to designated printer (Printers are named the same as the user account)

Using the above process I was able to drop it down to just one script, .procmailrc. This is what I have written in the .procmailrc file.

SHELL=/bin/bash

# Designate the printer. Printer names match usernames so you don't have to manually change 60+ files. 
printer=`whoami`

# Generate a unique ID
f=`uuidgen`

# Convert email, including headers and body into a HTML file and save off the images using MHONARC https://www.mhonarc.org/
# Open file and search <!--X-Body-of-Message--> string using SED and collect all text to EOF.
# Pipe the result into SED again to remove unwanted HTML tags added by MHONARC
# Pipe result into HTML2PS to convert to PostScript
# Pipe PostScript file to the designated printer
:0E
| mhonarc -single > ${f}.html; sed -n '/^<!--X-Body-of-Message-->$/ { s///; :a; n; p; ba; }' ${f}.html | sed -e '/<hr>/d' | html2ps | lp -d ${printer} -o media=a4 2>&1

# Finally, delete the email
:0
/dev/null

I do not know "sed" very well, and there may well be an easier way of achieving this. I will investigate further at some point.

Hope this helps someone :)