Rest Template unable to parse json rest api response properly

558 views Asked by At

I am trying to extract Named Entities from text using Spacy's NER for German text. I have exposed the service as a REST POST request which takes source text as input and returns a dictionary(Map) of list of named entities (person, location, organization). These services are exposed using Flask Restplus hosted on a linux server.

Consider for a sample text, I get following response using POST request at REST API exposed via Swagger UI:

{
  "ner_locations": [
    "Deutschland",
    "Niederlanden"
  ],
  "ner_organizations": [
    "Miele & Cie. KG",
    "Bayer CropScience AG"
  ],
  "ner_persons": [
    "Sebastian Krause",
    "Alex Schröder"
  ]
}

When I use Spring's RestTemplate to POST request at the API hosted at Linux server from Spring boot application (on Windows OS in Eclipse). The json parsing is done correctly. I have added following line for using UTF-8 encoding.

restTemplate.getMessageConverters().add(0, new StringHttpMessageConverter(Charset.forName("UTF-8")));

But When I deploy this spring boot application on linux machine and POST request to API for NER tagging, the ner_persons are not parsed correctly. While remotely debugging, I get following response

{
  "ner_locations": [
    "Deutschland",
    "Niederlanden"
  ],
  "ner_organizations": [
    "Miele & Cie. KG",
    "Bayer CropScience AG"
  ],
  "ner_persons": [
    "Sebastian ",
    "Krause",
    "Alex ",
    "Schröder"
  ]
}

I am not able to understand why this strange behavior occurs in case of persons but not organizations.

1

There are 1 answers

0
Ravi On

Being new to python, it took me 2 days of debugging to understand the real problem and to find a workaround fix.

The reason was that the names (e.g., "Sebastian Krause") were separated by \xa0 i.e., non-breaking space character (e.g., "Sebastian\xa0Krause") instead of a whitespace. So Spacy was failing to detect them as a single NamedEntity.

Browsing through SO, I found following solution from here:

import unicodedata 
norm_text = unicodedata.normalize("NFKD", source_text)

This also normalizes other unicode characters such as \u2013,\u2026, etc.