How to extract a non-English address from a string

142 views Asked by At

It would be a big help if someone here knows about a library for python that can extract from a whole message only an address in Turkey (the text is originally in Turkish).

A translated example would be:

"Hi, my name is Salem and I have information about a crash site at .....(Address goes here), Many Thanks."

I tried looking online and didn't find a library that had functionality on Turkish addresses, only some NLP projects for the US. The input is plain text. I have already translated it to English but I don't know how to specifically extract the address from the whole message.

1

There are 1 answers

0
Pawel Kam On

You are looking for personally identifiable information (PII) detection software.

There are plenty of open source libraries in this field, although I don’t know which of them (if any) is suited to handle your use case. Another issue to consider, is the amount of time you want to spend on configuration and writing additional software. It is worth to check them out first, because they are free within provisions of their licenses.

Next, you should consider paid software for PII detection. There are many such offerings. You should probably search for software focused on handling Turkish address names, which can be too specific for some tools. I’m an AWS guy so I use Amazon Comprehend, but there are other solutions as well, for example Azure Cognitive Service for Language and many others. Please find below an example of how this can be achieved with Amazon Comprehend detect_pii_entities API.

import boto3
from botocore.exceptions import ClientError

client = boto3.client('comprehend')
text =  'Hi, my name is Salem and I have information " \
"about a crash site at Sultan Ahmet, Ayasofya Meydanı No:1, 34122 Fatih/İstanbul, Turkey, Many Thanks.'

try:
    response = client.detect_pii_entities(Text=text, LanguageCode='en')
    entities = response['Entities']
except ClientError:
    entities = [] # no PII entities detected

In your API response you should get a JSON object determining where an address start and ends.

{
    "Entities": [
        {
            "Score": 0.9998736381530762,
            "Type": "NAME",
            "BeginOffset": 15,
            "EndOffset": 20
        },
        {
            "Score": 0.9996119737625122,
            "Type": "ADDRESS",
            "BeginOffset": 66,
            "EndOffset": 131
        }
    ]
}

You can then, for example, iterate over entities and get addresses by indices from the original text.

addresses = [e for e in entities if e['Type'] == 'ADDRESS']
for a in addresses:
    print(text[a['BeginOffset']:a['EndOffset']])
    # prints "Sultan Ahmet, Ayasofya Meydanı No:1, 34122 Fatih/İstanbul, Turkey"

Be aware that this particular tool is paid and you will have to authenticate before using it.