SGF is what's widely used to save Go (board game) games as text. One example of how it saves its data is this — as a whole, SGF is a text-based representation of a trie, which means it is recursive, but this question is regarding only the data in each node —:
GM[1]FF[4]CA[UTF-8]AP[Sabaki:0.52.2]KM[6.5]SZ[19]DT[2023-12-25]
Data is saved as pairs of strings plus data within brackets ([]
). The keys are usually of length 2, but not necessarily so; moves are encoded as B[aa]
or W[bb]
, where B
denotes Black, and W
is White, while aa
and bb
are the coordinates on the board.
My question is: what would then be the regex to extract these data groups as key-value objects? (I'm using JavaScript or TypeScript here.)
1. If only trying to match the (meta)data
Assuming only the simplest rules, the regex that you (might be) looking for is:
...where the
g
flag stands for "global" (matches all instances) andy
for "sticky" (each match either starts at the very first character or right after the end of the last match). If you are 120% confident that the content is valid, feel free to drop they
.Try it:
2. On the Full Grammar
According to the second link, this is the full SGF grammar:
Rule Tail is recursive, which means it is impossible to rewrite a rule depending on it (Collection, GameTree and itself) as an ECMAScript-compliant regex.
PCRE, on the other hand, has support for recursive pattern:
Try it on regex101.com.
Assuming Tail cannot contain itself (i.e.
Tail = "(" NodeSequence ")"
), the GameTree rule can be rewritten as (~1100 characters):Try it on regex101.com.
This only ever validates the content, giving you no groups. However, even if groups were added (e.g.
\(\s*(?<rootNode>...)(?<nodeSequence>...)(?<tails>...)\s*\)
), they would have to be re-parsed using the other rules.That said, assuming non-recursive structure, a parser using pure regex would repeat itself a lot.
Code used to generate the monster above:
Try it: