I am recently working on using nltk to extract relation from text. so i build a sample text:" Tom is the cofounder of Microsoft." and using following program to test and return nothing. I cannot figure out why.
I'm using NLTK version: 3.2.1, python version: 3.5.2.
Here is my code:
import re
import nltk
from nltk.sem.relextract import extract_rels, rtuple
from nltk.tokenize import sent_tokenize, word_tokenize
def test():
with open('sample.txt', 'r') as f:
sample = f.read() # "Tom is the cofounder of Microsoft"
sentences = sent_tokenize(sample)
tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.tag.pos_tag(sentence) for sentence in tokenized_sentences]
OF = re.compile(r'.*\bof\b.*')
for i, sent in enumerate(tagged_sentences):
sent = nltk.chunk.ne_chunk(sent) # ne_chunk method expects one tagged sentence
rels = extract_rels('PER', 'GPE', sent, corpus='ace', pattern=OF, window=10)
for rel in rels:
print('{0:<5}{1}'.format(i, rtuple(rel)))
if __name__ == '__main__':
test()
1. After some debug, if found that when i changed the input as
"Gates was born in Seattle, Washington on October 28, 1955. "
the nltk.chunk.ne_chunk() output is:
(S (PERSON Gates/NNS) was/VBD born/VBN in/IN (GPE Seattle/NNP) ,/, (GPE Washington/NNP) on/IN October/NNP 28/CD ,/, 1955/CD ./.)
The test() returns:
[PER: 'Gates/NNS'] 'was/VBD born/VBN in/IN' [GPE: 'Seattle/NNP']
2. After i changed the input as:
"Gates was born in Seattle on October 28, 1955. "
The test() retuns nothing.
3. I digged into nltk/sem/relextract.py and find this strange
output is caused by function: semi_rel2reldict(pairs, window=5, trace=False), which returns result only when len(pairs) > 2, and that's why when one sentence with less than three NEs will return None.
Is this a bug or i used NLTK in wrong way?
Firstly, to chunk NEs with
ne_chunk, the idiom would look something like this(see also https://stackoverflow.com/a/31838373/610569)
Next let's look at the
extract_relsfunction.When you evoke this function:
It performs 4 processes sequentially.
1. It checks whether your
subjclassandobjclassare validi.e. https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L202 :
2. It extracts "pairs" from your NE tagged inputs:
Now let's see given your input sentence
Tom is the cofounder of Microsoft, what doestree2semi_rel()returns:So it returns a list of 2 lists, the first inner list consist of a blank list and the
Treethat contains the "PERSON" tag.The second list consist of the phrase
is the cofounder ofand theTreethat contains "ORGANIZATION".Let's move on.
3.
extract_relthen tries to change the pairs to some sort of relation dictionaryIf we look what the
semi_rel2reldictfunction returns with your example sentence, we see that this is where the empty list gets returns:So let's look into the code of
semi_rel2reldicthttps://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L144:The first thing that
semi_rel2reldict()does is to check where there are more than 2 elements the output fromtree2semi_rel(), which your example sentence doesn't:Ah ha, that's why the
extract_relis returning nothing.Now comes the question of how to make
extract_rel()return something even with 2 elements fromtree2semi_rel()? Is that even possible?Let's try a different sentence:
But that only confirms that
extract_relcan't extract whentree2semi_relreturns pairs of < 2. What happens if we remove that condition ofwhile len(pairs) > 2?Why can't we do
while len(pairs) > 1?If we look closer into the code, we see the last line of populating the reldict, https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L169:
It tries to access a 3rd element of the
pairsand if the length of thepairsis 2, you'll get anIndexError.So what happens if we remove that
rconkey and simply change it towhile len(pairs) >= 2?To do that we have to override the
semi_rel2redict()function:Ah! It works but there's still a 4th step in
extract_rels().4. It performs a filter of the reldict given the regex you have provided to the
patternparameter, https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L222:Now let's try it with the hacked version of
semi_rel2reldict:It works! Now let's see it in tuple form: