How do you parse sections of text with Lark in Python

104 views Asked by At

I'm trying to figure out how to use the Lark Python Module to parse a document that looks like this:

---> TITLE

Introduction

---> CONTENT

The quick

Brown fox

---> TEST

Jumps over

---> CONTENT 

The lazy dog

Each ---> marks the start of a section of a specific type that has some content that goes until the next ---> section starts.

So far, I have this


from lark import Lark

parser = Lark(r"""
    start: section*
    | line*

    section.1 : "---> " SECTION_TITLE "\n\n"
    SECTION_TITLE.1 :  "TITLE" | "CONTENT" | "SOURCE" | "OUTPUT"

    line.-1: ANY_LINE
    ANY_LINE.-1: /.+\n*/

    """, start='start')

with open("src/index.mdx") as _in:
    print(parser.parse(_in.read()))

It parses the file, but everything shows up in ANY_LINE tokens instead of splitting out the section headers. I'm new to this type of parser and feel like I'm missing something obvious, but I haven't been able to figure it out.

1

There are 1 answers

0
Alan W. Smith On

I think this is doing what I'm after. Not marking this as the answer for now in case other folks have better ideas

parser = Lark(r"""
    start: section*
    
    section : THING SECTION_TITLE line*
    THING : "--->"
    SECTION_TITLE :  "TITLE" | "CONTENT" | "SOURCE" | "OUTPUT" | "TEST"

    line: ANY_LINE
    ANY_LINE.-1: /.+\n*/

    %import common.WS
    %ignore WS

    """, start='start')