I am trying to extract the Chomsky Normal Form (CNF) - grammar productions of a sentence from its parse tree:
(ROOT
(S
(NP (DT the) (NNS kids))
(VP (VBD opened)
(NP (DT the) (NN box))
(PP (IN on)
(NP (DT the) (NN floor))))))
I put the whole tree into a string named S and then:
tree = Tree.fromstring(S)
tree.chomsky_normal_form()
for p in tree.productions():
print p
The output is
(1) NN -> 'box'
(2) PP -> IN NP
(3) DT -> 'the'
(4) ROOT -> S
(5) NP -> DT NN
(6) VBD -> 'opened'
(7) VP|<NP-PP> -> NP PP
(8) VP -> VBD VP|<NP-PP>
(9) NP -> DT NNS
(10) NN -> 'floor'
(11) IN -> 'on'
(12) NNS -> 'kids'
(13) S -> NP VP
But some of the productions (number 7 & 8) don't seem to be CNF! What is the problem?
VP|<NP-PP>
is one nonterminal symbol. The vertical bar does not mean multiple options in the traditional sense. Rather, NLTK puts it there to indicate where the rule is derived from, i.e. "this new nonterminal symbol was derived from the combination of VP and NP-PP." It is a new production rule NLTK has created to convert your grammar into Chomsky Normal Form.Take a look at the productions of the tree, pre-CNF:
Specifically, look at the rule
VP -> VBD NP PP
, which is NOT in CNF (There must be exactly two nonterminal symbols on the RHS of any production rule)The two rules (7):
VP|<NP-PP> -> NP PP
and (8):VP -> VBD VP|<NP-PP>
in your question are functionally equivalent to the more general ruleVP -> VBD NP PP
.When
VP
is detected, rule application results in:VBD VP|<NP-PP>
And,
VP|<NP-PP>
is the LHS of the production rule created, which results in:VBD NP PP
Specifically, if you isolate the rule itself, you can take a look at the specific symbol (which is indeed singular):