I am recently working on using nltk to extract relation from text. so i build a sample text:" Tom is the cofounder of Microsoft." and using following program to test and return nothing. I cannot figure out why.
I'm using NLTK version: 3.2.1, python version: 3.5.2.
Here is my code:
import re
import nltk
from nltk.sem.relextract import extract_rels, rtuple
from nltk.tokenize import sent_tokenize, word_tokenize
def test():
with open('sample.txt', 'r') as f:
sample = f.read() # "Tom is the cofounder of Microsoft"
sentences = sent_tokenize(sample)
tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.tag.pos_tag(sentence) for sentence in tokenized_sentences]
OF = re.compile(r'.*\bof\b.*')
for i, sent in enumerate(tagged_sentences):
sent = nltk.chunk.ne_chunk(sent) # ne_chunk method expects one tagged sentence
rels = extract_rels('PER', 'GPE', sent, corpus='ace', pattern=OF, window=10)
for rel in rels:
print('{0:<5}{1}'.format(i, rtuple(rel)))
if __name__ == '__main__':
test()
1. After some debug, if found that when i changed the input as
"Gates was born in Seattle, Washington on October 28, 1955. "
the nltk.chunk.ne_chunk() output is:
(S (PERSON Gates/NNS) was/VBD born/VBN in/IN (GPE Seattle/NNP) ,/, (GPE Washington/NNP) on/IN October/NNP 28/CD ,/, 1955/CD ./.)
The test() returns:
[PER: 'Gates/NNS'] 'was/VBD born/VBN in/IN' [GPE: 'Seattle/NNP']
2. After i changed the input as:
"Gates was born in Seattle on October 28, 1955. "
The test() retuns nothing.
3. I digged into nltk/sem/relextract.py and find this strange
output is caused by function: semi_rel2reldict(pairs, window=5, trace=False), which returns result only when len(pairs) > 2, and that's why when one sentence with less than three NEs will return None.
Is this a bug or i used NLTK in wrong way?
Firstly, to chunk NEs with
ne_chunk
, the idiom would look something like this(see also https://stackoverflow.com/a/31838373/610569)
Next let's look at the
extract_rels
function.When you evoke this function:
It performs 4 processes sequentially.
1. It checks whether your
subjclass
andobjclass
are validi.e. https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L202 :
2. It extracts "pairs" from your NE tagged inputs:
Now let's see given your input sentence
Tom is the cofounder of Microsoft
, what doestree2semi_rel()
returns:So it returns a list of 2 lists, the first inner list consist of a blank list and the
Tree
that contains the "PERSON" tag.The second list consist of the phrase
is the cofounder of
and theTree
that contains "ORGANIZATION".Let's move on.
3.
extract_rel
then tries to change the pairs to some sort of relation dictionaryIf we look what the
semi_rel2reldict
function returns with your example sentence, we see that this is where the empty list gets returns:So let's look into the code of
semi_rel2reldict
https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L144:The first thing that
semi_rel2reldict()
does is to check where there are more than 2 elements the output fromtree2semi_rel()
, which your example sentence doesn't:Ah ha, that's why the
extract_rel
is returning nothing.Now comes the question of how to make
extract_rel()
return something even with 2 elements fromtree2semi_rel()
? Is that even possible?Let's try a different sentence:
But that only confirms that
extract_rel
can't extract whentree2semi_rel
returns pairs of < 2. What happens if we remove that condition ofwhile len(pairs) > 2
?Why can't we do
while len(pairs) > 1
?If we look closer into the code, we see the last line of populating the reldict, https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L169:
It tries to access a 3rd element of the
pairs
and if the length of thepairs
is 2, you'll get anIndexError
.So what happens if we remove that
rcon
key and simply change it towhile len(pairs) >= 2
?To do that we have to override the
semi_rel2redict()
function:Ah! It works but there's still a 4th step in
extract_rels()
.4. It performs a filter of the reldict given the regex you have provided to the
pattern
parameter, https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L222:Now let's try it with the hacked version of
semi_rel2reldict
:It works! Now let's see it in tuple form: