Imagine you get emails like these:
name1: value
name2: value
name3: value
...
These values should be inserted into a database with column names equal to names in the email.
However, the emails might have some errors, for example a typo, or using a abbreviation instead of a full name. Also, the writer might choose at random to change a name, for example change bike into bicycle.
These emails should be automatically processed, even if it has errors. The processing script should be able to "fix" the errors.
I thought a text classifying (convolutional) neural network might do the job, but it seems like it is overkill. Is there a better or a easier solution?
Here's some thoughts, since you know the keys (column names) in advance. Let's assume there's
color
anddensity
.cloor
could be matched tocolor
since the edit distance is 1. (However, if there are several matches with a low enough edit distance, you'll probably want to play it safe and not map the data.dens
, and there's only one column (density
) that starts withdens
, you can probably safely imagine it'sdensity
.For all unmapped columns I'd add a "stash" column to the database you can put the unrecognized data in (in, say, JSON format), and have the script alert the operator (you!) about unrecognized keys, so you can improve the logic, and use that logic to map data from the stash column to the real columns.