I am working on the parser of email message in MIME format. I am forced to use "posix regex" library written in C and I wonder of its behaviour.
Suppose we have following part of email message:
--------------010402010107070509040804
Content-Type: text/plain; charset=ISO-8859-2
Content-Transfer-Encoding: 8bit
plain message
--------------010402010107070509040804
Content-Type: text/html; charset=ISO-8859-2
Content-Transfer-Encoding: 8bit
html message
--------------010402010107070509040804--
Now I need to get different types of message (plain and html). I used following pattern to get data between boundaries:
^((.|\\s)+?)--------------010402010107070509040804
This pattern works well in some RegEx libraries. For example, when I wrote the same regex process in javascript, I was able to get those two parts of message without any problem.
However, "posix regex" library returns me the whole message excluding "--" at the end. This is its result:
--------------010402010107070509040804
Content-Type: text/plain; charset=ISO-8859-2
Content-Transfer-Encoding: 8bit
plain message
--------------010402010107070509040804
Content-Type: text/html; charset=ISO-8859-2
Content-Transfer-Encoding: 8bit
html message
--------------010402010107070509040804
Why it did not stop after finding first occurrence of boundary after plain message? Am I missing something?
POSIX doesn't have greediness modifiers, there's a way to do it but it's ugly and long. To simplify, say the token was much shorter, like
--123
, you'd need this regex:That's already insanely long for something so simple. Basically you're telling the regex that you want a repetition of anything that isn't
-
, or a-
followed by anything that isn't-
, or--
followed by anything that isn't1
, and so on and so on.I made a script to produce a regex from an input token and ran it with
--------------010402010107070509040804
and it gave me this:A beast but the best POSIX can do as far as I know :P