I have a dataframe (more than 1 million rows) that has an open text columns for customer can write whatever they want. Misspelled words appear frequently and I'm trying to group comments that are grammatically the same.
For example:
ID | Comment |
---|---|
1 | I want to change my credit card |
2 | I wannt change my creditt card |
3 | I want change credit caurd |
I have tried using Levenshtein Distance but computationally it is very expensive. Can you tell me another way to do this task?
Thanks!
Levenshtein Distance has time complexity O(N^2).
If you define a maximum distance you're interested in, say m, you can reduce the time complexity to O(Nxm). The maximum distance, in your context, is the maximum number of typos you accept while still considering two comments as identical.
If you cannot do that, you may try to parallelize the task.