Anaphora resolution in stanford-nlp using python

4.2k views Asked by At

I am trying to do anaphora resolution and for that below is my code.

first i navigate to the folder where i have downloaded the stanford module. Then i run the command in command prompt to initialize stanford nlp module

java -mx4g -cp "*;stanford-corenlp-full-2017-06-09/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

After that i execute below code in Python

from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')

I want to change the sentence Tom is a smart boy. He know a lot of thing. into Tom is a smart boy. Tom know a lot of thing. and there is no tutorial or any help available in Python.

All i am able to do is annotate by below code in Python

coreference resolution

output = nlp.annotate(sentence, properties={'annotators':'dcoref','outputFormat':'json','ner.useSUTime':'false'})

and by parsing for coref

coreferences = output['corefs']

i get below JSON

coreferences

{u'1': [{u'animacy': u'ANIMATE',
   u'endIndex': 2,
   u'gender': u'MALE',
   u'headIndex': 1,
   u'id': 1,
   u'isRepresentativeMention': True,
   u'number': u'SINGULAR',
   u'position': [1, 1],
   u'sentNum': 1,
   u'startIndex': 1,
   u'text': u'Tom',
   u'type': u'PROPER'},
  {u'animacy': u'ANIMATE',
   u'endIndex': 6,
   u'gender': u'MALE',
   u'headIndex': 5,
   u'id': 2,
   u'isRepresentativeMention': False,
   u'number': u'SINGULAR',
   u'position': [1, 2],
   u'sentNum': 1,
   u'startIndex': 3,
   u'text': u'a smart boy',
   u'type': u'NOMINAL'},
  {u'animacy': u'ANIMATE',
   u'endIndex': 2,
   u'gender': u'MALE',
   u'headIndex': 1,
   u'id': 3,
   u'isRepresentativeMention': False,
   u'number': u'SINGULAR',
   u'position': [2, 1],
   u'sentNum': 2,
   u'startIndex': 1,
   u'text': u'He',
   u'type': u'PRONOMINAL'}],
 u'4': [{u'animacy': u'INANIMATE',
   u'endIndex': 7,
   u'gender': u'NEUTRAL',
   u'headIndex': 4,
   u'id': 4,
   u'isRepresentativeMention': True,
   u'number': u'SINGULAR',
   u'position': [2, 2],
   u'sentNum': 2,
   u'startIndex': 3,
   u'text': u'a lot of thing',
   u'type': u'NOMINAL'}]}

Any help on this?

3

There are 3 answers

1
ongenz On BEST ANSWER

Here is one possible solution that uses the data structure output by CoreNLP. All the information is provided. This is not intended as a full solution and extensions are probably required to deal with all situations, but this is a good starting point.

from pycorenlp import StanfordCoreNLP

nlp = StanfordCoreNLP('http://localhost:9000')


def resolve(corenlp_output):
    """ Transfer the word form of the antecedent to its associated pronominal anaphor(s) """
    for coref in corenlp_output['corefs']:
        mentions = corenlp_output['corefs'][coref]
        antecedent = mentions[0]  # the antecedent is the first mention in the coreference chain
        for j in range(1, len(mentions)):
            mention = mentions[j]
            if mention['type'] == 'PRONOMINAL':
                # get the attributes of the target mention in the corresponding sentence
                target_sentence = mention['sentNum']
                target_token = mention['startIndex'] - 1
                # transfer the antecedent's word form to the appropriate token in the sentence
                corenlp_output['sentences'][target_sentence - 1]['tokens'][target_token]['word'] = antecedent['text']


def print_resolved(corenlp_output):
    """ Print the "resolved" output """
    possessives = ['hers', 'his', 'their', 'theirs']
    for sentence in corenlp_output['sentences']:
        for token in sentence['tokens']:
            output_word = token['word']
            # check lemmas as well as tags for possessive pronouns in case of tagging errors
            if token['lemma'] in possessives or token['pos'] == 'PRP$':
                output_word += "'s"  # add the possessive morpheme
            output_word += token['after']
            print(output_word, end='')


text = "Tom and Jane are good friends. They are cool. He knows a lot of things and so does she. His car is red, but " \
       "hers is blue. It is older than hers. The big cat ate its dinner."

output = nlp.annotate(text, properties= {'annotators':'dcoref','outputFormat':'json','ner.useSUTime':'false'})

resolve(output)

print('Original:', text)
print('Resolved: ', end='')
print_resolved(output)

This gives the following output:

Original: Tom and Jane are good friends. They are cool. He knows a lot of things and so does she. His car is red, but hers is blue. It is older than hers. The big cat ate his dinner.
Resolved: Tom and Jane are good friends. Tom and Jane are cool. Tom knows a lot of things and so does Jane. Tom's car is red, but Jane's is blue. His car is older than Jane's. The big cat ate The big cat's dinner.

As you can see, this solution doesn't deal with correcting the case when a pronoun has a sentence-initial (title-case) antecedent ("The big cat" instead of "the big cat" in the last sentence). This depends on the category of the antecedent - common noun antecedents need lowercasing, while proper noun antecedents wouldn't. Some other ad hoc processing might be necessary (as for the possessives in my test sentence). It also presupposes that you will not want to reuse the original output tokens, as they are modified by this code. A way around this would be to make a copy of the original data structure or create a new attribute and change the print_resolved function accordingly. Correcting any resolution errors is also another challenge!

1
Archana On

I had the similar problem. I solved it using neural coref after trying to do with core nlp. You can easily do the work through neural coref by using the following code:

import spacy

nlp = spacy.load('en_coref_md')

doc = nlp(u'Phone area code will be valid only when all the below conditions are met. It cannot be left blank. It should be numeric. It cannot be less than 200. Minimum number of digits should be 3. ')

print(doc._.coref_clusters)

print(doc._.coref_resolved)

The output of the above code is:
[Phone area code: [Phone area code, It, It, It]]

Phone area code will be valid only when all the below conditions are met. Phone area code cannot be left blank. Phone area code should be numeric. Phone area code cannot be less than 200. Minimum number of digits should be 3.

For this you will need to have spacy, along with the English models which can be en_coref_md or en_coref_lg or en_coref_sm. You can refer the following link for better explanation:

https://github.com/huggingface/neuralcoref

0
ezChx On
from stanfordnlp.server import CoreNLPClient
from nltk import tokenize

client = CoreNLPClient(annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner', 'parse', 'coref'], memory='4G', endpoint='http://localhost:9001')

def pronoun_resolution(text):

    ann = client.annotate(text)
    modified_text = tokenize.sent_tokenize(text)

    for coref in ann.corefChain:

        antecedent = []
        for mention in coref.mention:
            phrase = []
            for i in range(mention.beginIndex, mention.endIndex):
                phrase.append(ann.sentence[mention.sentenceIndex].token[i].word)
            if antecedent == []:
                antecedent = ' '.join(word for word in phrase)
            else:
                anaphor = ' '.join(word for word in phrase)
                modified_text[mention.sentenceIndex] = modified_text[mention.sentenceIndex].replace(anaphor, antecedent)

    modified_text = ' '.join(modified_text)

    return modified_text

text = 'Tom is a smart boy. He knows a lot of things.'
pronoun_resolution(text)

Output: 'Tom is a smart boy. Tom knows a lot of things.'