python recognize text in email

46 views Asked by At

Imagine you get emails like these:

name1: value
name2: value
name3: value

...

These values should be inserted into a database with column names equal to names in the email.

However, the emails might have some errors, for example a typo, or using a abbreviation instead of a full name. Also, the writer might choose at random to change a name, for example change bike into bicycle.

These emails should be automatically processed, even if it has errors. The processing script should be able to "fix" the errors.

I thought a text classifying (convolutional) neural network might do the job, but it seems like it is overkill. Is there a better or a easier solution?

1

There are 1 answers

0
AKX On

Here's some thoughts, since you know the keys (column names) in advance. Let's assume there's color and density.

  • You could use something like edit distances (Levenshtein distance, for instance) to match any unrecognized ones to the closest actual one (if it's close enough). Say, cloor could be matched to color since the edit distance is 1. (However, if there are several matches with a low enough edit distance, you'll probably want to play it safe and not map the data.
  • Similarly, for abbreviations you could elect to map them by unique prefix, i.e. if someone uses dens, and there's only one column (density) that starts with dens, you can probably safely imagine it's density.

For all unmapped columns I'd add a "stash" column to the database you can put the unrecognized data in (in, say, JSON format), and have the script alert the operator (you!) about unrecognized keys, so you can improve the logic, and use that logic to map data from the stash column to the real columns.