How to get Dependency Tree in JSON format in SyntaxNet?

567 views Asked by At

I am trying to get a dependency tree in JSON format from SyntaxNet but all I get from the examples is a Sentence Object which is providing no accessors to access the parsed object or even iterate through the items listed.

When I run the examples from the docker file provided by TensorFlow/SyntaxNet, what I get is as below

text: "Alex saw Bob"
token {
  word: "Alex"
  start: 0
  end: 3
  head: 1
  tag: "attribute { name: \"Number\" value: \"Sing\" } attribute { name: \"fPOS\" value: \"PROPN++NNP\" } "
  category: ""
  label: "nsubj"
  break_level: NO_BREAK
}
token {
  word: "saw"
  start: 5
  end: 7
  tag: "attribute { name: \"Mood\" value: \"Ind\" } attribute { name: \"Tense\" value: \"Past\" } attribute { name: \"VerbForm\" value: \"Fin\" } attribute { name: \"fPOS\" value: \"VERB++VBD\" } "
  category: ""
  label: "root"
  break_level: SPACE_BREAK
}
token {
  word: "Bob"
  start: 9
  end: 11
  head: 1
  tag: "attribute { name: \"Number\" value: \"Sing\" } attribute { name: \"fPOS\" value: \"PROPN++NNP\" } "
  category: ""
  label: "parataxis"
  break_level: SPACE_BREAK
}

The class type of this object is class 'syntaxnet.sentence_pb2.Sentence' which in it self does not have any documentation.

I need to be able to access the above output programmatically.

As seen in this question, It prints a table in string format and does not give me a programmatic response.

How can i get the response and not a print output. or should i write a parser for this output..?

1

There are 1 answers

0
Ido.Schwartzman On BEST ANSWER

TL;DR Code at the end...

The Sentence object is an instance of the sentence_pb2.Setnence class, which is generated from protobuf definition files, specifically sentence.proto. This means that if you look at sentence.proto, you will see the fields that are defined for that class and their types. So you have a field called "tag" which is a string, a field called "label" which is a string, a field called head which is an integer and so on. In theory if you just convert to json using python's built-in functions it should work, but since protobuf classes are runtime generated metaclasses, they may produce some undesired results.

So what I did was first created a map object with all the info I wanted, then converted that to json:

def parse_attributes(attributes):
    matches = attribute_expression.findall(attributes)
    return {k: v for k, v in matches}

def token_to_dict(token):
    def extract_pos(fpos):
        i = fpos.find("++")
        if i == -1:
            return fpos, "<error>"
        else:
            return fpos[:i], fpos[i + 2:]

    attributes = parse_attributes(token.tag)
    if "fPOS" not in attributes:
        logging.warn("token {} has no fPos attribute".format(token.word))
        logging.warn("attributes are: {}".format(attributes))
        fpos = ""
    else:
        fpos = attributes["fPOS"]

    upos, xpos = extract_pos(fpos)
    return {
        'word': token.word,
        'start': token.start,
        'end': token.end,
        'head': token.head,
        'features': parse_attributes(token.tag),
        'tag': token.tag,
        'deprel': token.label,
        'upos': upos,
        'xpos': xpos
    }

def sentence_to_dict(anno):
    return {
        'text': anno.text,
        'tokens': [token_to_dict(token) for token in anno.token]
    }

If you call sentence_to_dict on the sentence object, you'll get a nice map which can then be serialized as json.