Capture names containing --but not ending-- in dashes

40 views Asked by At

I am trying to capture names (not starting with a number) which could contain dashes, such as hello-world. My problem is that I also have rules for single dashes and symbols which conflict with it:

[A-Za-z][A-Za-z0-9-]+     { /* capture "hello-world" */ }
"-"                       { return '-'; }
">"                       { return '>'; }

When the lexer reads hello-world-> the previous rules yield hello-world- and >, whereas I expected hello-world, - and > to be captured individually. To solve it I fixed it this way:

[A-Za-z][A-Za-z0-9-]*[A-Za-z0-9]+     { /* ensure final dash is never included at the end */ }

That works, except for single-letter words, so finally I implemented this:

[A-Za-z][A-Za-z0-9-]*[A-Za-z0-9]+     { /* ensure final dash is never included at the end */ }
[A-Za-z][A-Za-z0-9]*                  { /* capture possible single letter words */ }

Question: Is there a more elegant way to do it?

1

There are 1 answers

0
sepp2k On BEST ANSWER
[A-Za-z][A-Za-z0-9-]*[A-Za-z0-9]+
[A-Za-z][A-Za-z0-9]*

Note that, as you said, the first rule already covers everything that's not a single letter. So the second rule only has to match single letters and can be shortened to just [A-Za-z]:

[A-Za-z][A-Za-z0-9-]*[A-Za-z0-9]+
[A-Za-z]

Now the second rule is a mere prefix of the first, so we can combine this into a single rule by making the part after the first letter optional:

[A-Za-z]([A-Za-z0-9-]*[A-Za-z0-9]+)?

The + on the last bit is unnecessary because everything except the last character can as well be matched by the middle part, so the simplest version is:

[A-Za-z]([A-Za-z0-9-]*[A-Za-z0-9])?