I am trying to implement CYK algorithm in Ruby according to pseudocode from Wikipedia. My implementation fails to produce the correct parse table. In the method given below, grammar
is a member of my own grammar class. Here is the code:
# checks whether a grammar accepts given string
# assumes input grammar to be in CNF
def self.parse(grammar, string)
n = string.length
r = grammar.nonterminals.size
# create n x n x r matrix
tbl = Array.new(n) { |_| Array.new(n) { |_| Array.new(r, false) } }
(0...n).each { |s|
grammar.rules.each { |rule|
# check if rule is unit production: A -> b
next unless rule.rhs.size == 1
unit_terminal = rule.rhs[0]
if unit_terminal.value == string[s]
v = grammar.nonterminals.index(rule.lhs)
tbl[0][s][v] = true
end
}
}
(1...n).each { |l|
(0...n - l + 1).each { |s|
(0..l - 1).each { |p|
# enumerate over A -> B C rules, where A, B and C are
# indices in array of NTs
grammar.rules.each { |rule|
next unless rule.rhs.size == 2
a = grammar.nonterminals.index(rule.lhs)
b = grammar.nonterminals.index(rule.rhs[0])
c = grammar.nonterminals.index(rule.rhs[1])
if tbl[p][s][b] and tbl[l - p][s + p][c]
tbl[l][s][a] = true
end
}
}
}
}
v = grammar.nonterminals.index(grammar.start_sym)
return tbl[n - 1][0][v]
end
I tested it with this simple example:
grammar:
A -> B C
B -> 'x'
C -> 'y'
string: 'xy'
The parse table tbl
was the following:
[[[false, true, false], [false, false, true]],
[[false, false, false], [false, false, false]]]
The problem definitely lies in the second part of the algorithm - substrings of length larger than 1. The first layer (tbl[0]
) contains correct values.
Help much appreciated.
The problem lies in the translation from the 1-based arrays in the pseudocode to the 0-based arrays in your code. It becomes obvious when you look at the first indices in the condition
tbl[p][s][b] and tbl[l-p][s+p][c]
in the very first run of the loop. The pseudocode checkstbl[1] and tbl[1]
and your code checkstbl[0] and tbl[1]
.I think you have to make the 0-based correction when you access the array and not in the ranges for
l
andp
. Otherwise the calculations with the indices are wrong. This should work: