SpaCy Coreferee: How to cleanly extract coreferenced text

528 views Asked by At

I am using SpaCy coreferee plugin. The execution is quite simple:

import coreferee, spacy
nlp = spacy.load('en_core_web_trf')
nlp.add_pipe('coreferee')

doc = nlp("Although he was very busy with his work, Peter had had enough of it. He and his wife decided they needed a holiday. They travelled to Spain because they loved the country very much.")

doc._.coref_chains.print()
0: he(1), his(6), Peter(9), He(16), his(18)
1: work(7), it(14)
2: [He(16); wife(19)], they(21), They(26), they(31)
3: Spain(29), country(34)

The problem I am having is how to map the coreference cluster back to text and return coreferenced text. I guess I would somehow need to iterate over all tokens in the doc and check if they can be mapped and solved with coreference clusters. I have little experience with SpaCy, so I don't really know what's the best route to achieve this.

1

There are 1 answers

0
Tomaž Bratanič On

The solution is the following:

resolved_text = ""
for token in coref_doc:
  
    repres = coref_doc._.coref_chains.resolve(token)
    print(repres)
    if repres:
        resolved_text += " " + " and ".join([t.text for t in repres])
    else:
        resolved_text += " " + token.text
    
print(resolved_text)

which returns

Although Peter was very busy with Peter work , Peter had had enough of work . Peter and Peter wife decided Peter and wife needed a holiday . Peter and wife travelled to Spain because Peter and wife loved the Spain very much . "