Posix RegEx for parsing email message - how to stop after first occurrence of boundary

296 views Asked by At

I am working on the parser of email message in MIME format. I am forced to use "posix regex" library written in C and I wonder of its behaviour.

Suppose we have following part of email message:

--------------010402010107070509040804
Content-Type: text/plain; charset=ISO-8859-2
Content-Transfer-Encoding: 8bit

plain message

--------------010402010107070509040804
Content-Type: text/html; charset=ISO-8859-2
Content-Transfer-Encoding: 8bit

html message

--------------010402010107070509040804--

Now I need to get different types of message (plain and html). I used following pattern to get data between boundaries:

^((.|\\s)+?)--------------010402010107070509040804

This pattern works well in some RegEx libraries. For example, when I wrote the same regex process in javascript, I was able to get those two parts of message without any problem.

However, "posix regex" library returns me the whole message excluding "--" at the end. This is its result:

--------------010402010107070509040804
Content-Type: text/plain; charset=ISO-8859-2
Content-Transfer-Encoding: 8bit

plain message

--------------010402010107070509040804
Content-Type: text/html; charset=ISO-8859-2
Content-Transfer-Encoding: 8bit

html message

--------------010402010107070509040804

Why it did not stop after finding first occurrence of boundary after plain message? Am I missing something?

1

There are 1 answers

0
asontu On

POSIX doesn't have greediness modifiers, there's a way to do it but it's ugly and long. To simplify, say the token was much shorter, like --123, you'd need this regex:

^(([^-]|-[^-]|--[^1]|--1[^2]|--12[^3])+)

That's already insanely long for something so simple. Basically you're telling the regex that you want a repetition of anything that isn't -, or a - followed by anything that isn't -, or -- followed by anything that isn't 1, and so on and so on.

I made a script to produce a regex from an input token and ran it with --------------010402010107070509040804 and it gave me this:

^(([^-]|-[^-]|--[^-]|---[^-]|----[^-]|-----[^-]|------[^-]|-------[^-]|--------[^-]|---------[^-]|----------[^-]|-----------[^-]|------------[^-]|-------------[^-]|--------------[^0]|--------------0[^1]|--------------01[^0]|--------------010[^4]|--------------0104[^0]|--------------01040[^2]|--------------010402[^0]|--------------0104020[^1]|--------------01040201[^0]|--------------010402010[^1]|--------------0104020101[^0]|--------------01040201010[^7]|--------------010402010107[^0]|--------------0104020101070[^7]|--------------01040201010707[^0]|--------------010402010107070[^5]|--------------0104020101070705[^0]|--------------01040201010707050[^9]|--------------010402010107070509[^0]|--------------0104020101070705090[^4]|--------------01040201010707050904[^0]|--------------010402010107070509040[^8]|--------------0104020101070705090408[^0]|--------------01040201010707050904080[^4])+)

A beast but the best POSIX can do as far as I know :P