Proper way to parse multiple items

101 views Asked by At

I have an input file with multiple lines and fields separated by space. My definition files are:

scanner.xrl:

Definitions.

DIGIT = [0-9]
ALPHANUM = [0-9a-zA-Z_]

Rules.

(\s|\t)+ : skip_token.
\n : {end_token, {new_line, TokenLine}}.
{ALPHANUM}+ : {token, {string, TokenLine, TokenChars}}.

Erlang code.

parser.yrl:

Nonterminals line.

Terminals string.

Rootsymbol line.

Endsymbol new_line.

line -> string : ['$1'].
line -> string line: ['$1'|'$2'].

Erlang code.

When running it as it is, the first line is parsed and then it stops:

1> A = <<"a b c\nd e\nf\n">>.

2> {ok, T, _} = scanner:string(binary_to_list(A)).
{ok,[{string,1,"a"},
     {string,1,"b"},
     {string,1,"c"},
     {new_line,1},
     {string,2,"d"},
     {string,2,"e"},
     {new_line,2},
     {string,3,"f"},
     {new_line,3}],
    4}
3> parser:parse(T).
{ok,[{string,1,"a"},{string,1,"b"},{string,1,"c"}]}

If I remove the Endsymbol line from parser.yrl and change the scanner.xrl file as follow:

Definitions.

DIGIT = [0-9]
ALPHANUM = [0-9a-zA-Z_]

Rules.

(\s|\t|\n)+ : skip_token.
{ALPHANUM}+ : {token, {string, TokenLine, TokenChars}}.

Erlang code.

All my line are parsed as a single item:

1> A = <<"a b c\nd e\nf\n">>.
<<"a b c\nd e\nf\n">>
2> {ok, T, _} = scanner:string(binary_to_list(A)).
{ok,[{string,1,"a"},
     {string,1,"b"},
     {string,1,"c"},
     {string,2,"d"},
     {string,2,"e"},
     {string,3,"f"}],
    4}
3> parser:parse(T).
{ok,[{string,1,"a"},
     {string,1,"b"},
     {string,1,"c"},
     {string,2,"d"},
     {string,2,"e"},
     {string,3,"f"}]}

What would be the proper way to signal to the parser that each line should be treated as a separate item? I would like my result to look something like:

{ok,[[{string,1,"a"},
     {string,1,"b"},
     {string,1,"c"}],
     [{string,2,"d"},
     {string,2,"e"}],
     [{string,3,"f"}]]}
1

There are 1 answers

0
Yan Valuyskiy On BEST ANSWER

Here is one of the correct lexer/parser pair that does the job with 1 shift/reduce only but I think it will solve your problem, you only need to cleanup tokens as you prefer.

I'm pretty sure there can be much easier and faster way to do it, but during my "lexer fighting times" it was so hard to find at least some information that I hope this will give the idea how to proceed with parsing with Erlang.

scanner.xrl

Definitions.

DIGIT = [0-9]
ALPHANUM = [0-9a-zA-Z_]

Rules.

(\s|\t)+ : skip_token.
\n : {token, {line, TokenLine}}.
{ALPHANUM}+ : {token, {string, TokenLine, TokenChars}}.

Erlang code.

parser.yrl

Nonterminals 
    Lines
    Line
    Strings.

Terminals string line.

Rootsymbol Lines.

Lines -> Line Lines : lists:flatten(['$1', '$2']).
Lines -> Line : lists:flatten(['$1']).

Line -> Strings line : {line, lists:flatten(['$1'])}.
Line -> Strings : {line, lists:flatten(['$1'])}.

Strings -> string Strings : lists:append(['$1'], '$2').
Strings -> string : lists:flatten(['$1']).

Erlang code.

output

{ok,[{line,[{string,1,"a"},{string,1,"b"},{string,1,"c"}]},
     {line,[{string,2,"d"},{string,2,"e"}]},
     {line,[{string,3,"f"}]}]}

The parser flow is the following:

  • Root defined as abstract "Lines"
  • "Lines" contains "Line + Lines" or simply "Line", which gives the looping
  • "Line" contains from "Strings + line" or simple "Strings" when it is end of file
  • "Strings" contains from 'string' or "'string' + Strings" when there are many strings provided
  • 'line' is the '\n' symbol

Please allow me to give few comments on issues I've discovered in the original code.

  • You should consider a whole file as a nested array not like a parsing per line, this is why Lines/Line abstracts provided
  • "Terminals" means that tokens won't be analysed for containing ANY other token, "Nonterminals" will be evaluated further, these are complex data