Entity Attribute Extraction On Unstructured Medical Text

689 views Asked by At

I am working on Named entities and their attribute extraction. Where my objective is to extract attributes associated with a particular entity in the sentence.

For example - "The Patient report is Positive for ABC disease"

In above sentence, ABC is a Entity and Positive is a Attribute defining ABC.

I am looking for an concise approach to extract the attributes, I already formulated a solution to extract entities which is working seamlessly with respectable accuracy and now working on second part of the problem statement to extract its associated attributes.

I tried extracting attributes with rule based approach which providing descent result but having following cons:

  • Source code is unmanageable.
  • Its not at all generic and difficult to manage new scenarios.
  • Time consuming.

To portray a more generic solution I explored different NLP techniques and found Dependency Tree Parsing as a potential solution.

Looking for suggestion/inputs on how to solve this problem using dependency tree parsing using Python/Java.

Feel free to suggest any other technique which could potentially help here.

1

There are 1 answers

1
David Dale On

I suggest to use the spacy python library because it is easy to use and has a decent dependency parser.

A baseline solution would traverse the dependency tree in a breadth-first fashion starting from your entity of interest, until it encounters a token that looks like an attribute or until it walks too far from the entity.

Further improvements to this solution would include:

  • Some rules for handling negations such as "not positive"
  • A better classifier for attributes (here I just look for adjectives)
  • Some rules about what types of dependency and what tokens should be taken into account

Here is my baseline code:

import spacy
nlp = spacy.load("en_core_web_sm")
text = "The Patient report is Positive for ABC disease"
doc = nlp(text)
tokens = {token.text:token for token in doc}

def is_attribute(token):
    # todo: use a classifier to determine whether the token is an attrubute
    return token.pos_ == 'ADJ'

def bfs(token, predicate, max_distance=3):
    queue = [(token, 0)]
    while queue:
        t, dist = queue.pop(0)
        if max_distance and dist > max_distance:
            return
        if predicate(t):
            return t
        # todo: maybe, consider only specific types of dependencies or tokens
        neighbors =  [t.head] + list(t.children)
        for n in neighbors:
            if n and n.text:
                queue.append((n, dist+1))

print(bfs(tokens['ABC'], is_attribute))  # Positive