I am trying to extract Named Entities from text using Spacy's NER for German text. I have exposed the service as a REST POST request which takes source text as input and returns a dictionary(Map) of list of named entities (person, location, organization). These services are exposed using Flask Restplus hosted on a linux server.
Consider for a sample text, I get following response using POST request at REST API exposed via Swagger UI:
{
"ner_locations": [
"Deutschland",
"Niederlanden"
],
"ner_organizations": [
"Miele & Cie. KG",
"Bayer CropScience AG"
],
"ner_persons": [
"Sebastian Krause",
"Alex Schröder"
]
}
When I use Spring's RestTemplate to POST request at the API hosted at Linux server from Spring boot application (on Windows OS in Eclipse). The json parsing is done correctly. I have added following line for using UTF-8 encoding.
restTemplate.getMessageConverters().add(0, new StringHttpMessageConverter(Charset.forName("UTF-8")));
But When I deploy this spring boot application on linux machine and POST request to API for NER tagging, the ner_persons are not parsed correctly. While remotely debugging, I get following response
{
"ner_locations": [
"Deutschland",
"Niederlanden"
],
"ner_organizations": [
"Miele & Cie. KG",
"Bayer CropScience AG"
],
"ner_persons": [
"Sebastian ",
"Krause",
"Alex ",
"Schröder"
]
}
I am not able to understand why this strange behavior occurs in case of persons but not organizations.
Being new to python, it took me 2 days of debugging to understand the real problem and to find a workaround fix.
The reason was that the names (e.g., "Sebastian Krause") were separated by \xa0 i.e., non-breaking space character (e.g., "Sebastian\xa0Krause") instead of a whitespace. So Spacy was failing to detect them as a single NamedEntity.
Browsing through SO, I found following solution from here:
This also normalizes other unicode characters such as \u2013,\u2026, etc.