I'm trying to figure out how to use the Lark Python Module to parse a document that looks like this:
---> TITLE
Introduction
---> CONTENT
The quick
Brown fox
---> TEST
Jumps over
---> CONTENT
The lazy dog
Each ---> marks the start of a section of a specific type that has some content that goes until the next ---> section starts.
So far, I have this
from lark import Lark
parser = Lark(r"""
start: section*
| line*
section.1 : "---> " SECTION_TITLE "\n\n"
SECTION_TITLE.1 : "TITLE" | "CONTENT" | "SOURCE" | "OUTPUT"
line.-1: ANY_LINE
ANY_LINE.-1: /.+\n*/
""", start='start')
with open("src/index.mdx") as _in:
print(parser.parse(_in.read()))
It parses the file, but everything shows up in ANY_LINE tokens instead of splitting out the section headers. I'm new to this type of parser and feel like I'm missing something obvious, but I haven't been able to figure it out.
I think this is doing what I'm after. Not marking this as the answer for now in case other folks have better ideas