Trailing context match EOF flex-lexer

148 views Asked by At

I want to match these two tokens:
1. NUM: A series of characters in [0-9_] with an optional . in between.
2. ID: A series of characters in [a-zA-Z0-9_] with at-least one [a-zA-Z] character.

Flex rules for these will be:

[0-9_]+([.][0-9_]+)?/[^a-zA-Z0-9_] return NUM;

[a-zA-Z0-9_]*[a-zA-Z][a-zA-Z0-9_]* return ID;

.
.
.

Note that, trailing context is required for NUM since "123.456ab" should match 123-NUM, .-OPER & 456ab-ID. Without the trailing context, it would match 123.456-NUM & ab-ID.

But now the problem is, it will not match NUM followed by EOF. So, how to match EOF in trailing context of flex rule?

TL;DR:
What I want: NUM not followed by [a-zA-Z0-9_].
What I'm getting currently: NUM followed by a character other than [a-zA-Z0-9_].
These two differ in at EOF.

EDIT: Just got to know that Re/Flex supports word boundaries. If I shift from using Flex to Re/Flex, is there any performance downsides? or any other things that I should be aware of?

1

There are 1 answers

7
rici On BEST ANSWER

The fact that you can't put EOF in trailing context is occasionally annoying but there's almost always a workaround, usually based on using maximal munch match ordering to ensure that some pattern matches at EOF because any other match would be longer. (Remember that trailing context counts for length comparison even though it's not part of the final token).

Here's one example:

[0-9_]+/[.][0-9_]*[a-zA-Z]    return NUM;
[0-9_]+([.][0-9_]+)?          return NUM;
[0-9_]*[a-zA-Z][a-zA-Z0-9_]*  return ID;

Pattern one matches digits followed by a decimal point if the decimal point is in turn followed by something which could be an ID.

Pattern two matches any number, regardless of what (if anything) follows.

Pattern three matches an ID (at least one letter). (It has the same effect as your second pattern. I just shortened the first character class; since * makes the prefix optional anyway, an ID with a leading letter can be matched directly by the rest of the pattern.)

We count on maximal munch to avoid pattern two matching prematurely. Numbers without decimal point followed by a letter will have a longer match at pattern three; numbers with a decimal point followed by a letter will have a longer match with pattern three. All that's left are numbers not followed by a letter; for those pattern two will apply.