I am using F#'s FsLex to generate a lexer. I have difficulties to understand the following two lines from a textbook. Why is the newline (\n) treated differently from the white space? In particular, what does "lexbuf.EndPos <- lexbuf.EndPos.NextLine" do differently from "Tokenize lexbuf"?
rule Tokenize = parse
| [' ' '\t' '\r'] { Tokenize lexbuf }
| '\n' { lexbuf.EndPos <- lexbuf.EndPos.NextLine; Tokenize lexbuf }
A
rule
is essentially a function that takes a lexer buffer as an argument. Each case on the left side of your rule matches a given character (e.g.,'\n'
) or class of characters ([' ' '\t' '\r']
) in your input. The expression on the right size of the rule, inside the curly braces{ ... }
, defines an action. The purpose of the definition you pasted in appears to be a tokenizer.The expression
Tokenize lexbuf
is a recursive call to theTokenize
rule. In essence, this rule ignores whitespace character. Why? Because tokenizers aim to simplify the input. Whitespace typically has no meaning in a programming language, so this rule filters it out. Tokenized input generally makes writing your parser simpler later on. You'll eventually want to add other cases to yourTokenize
rule (e.g., for keywords, assignment statements, and other expressions) to produce a complete lexer definition.The second rule, the one that matches
\n
, also ignores the whitespace, but as you correctly point out, it does something different. What it's doing is updating the position of the end of the line (lexbuf.EndPos
) with the position of the next line's end (lexbuf.EndPos.NextLine
) before recursively callingTokenize
again. Why? Presumably so that the end position is correct on the next recursive call.Since you're only showing a lexer fragment here, I can only guess as to what
lexbug.EndPos
is used for, but it's pretty common to keep that information around for diagnostic purposes.