I am trying to do data normalization and populate correct zipcode, city, and state. Data contains zipcode, city, state, and address fields information along with lots of wrong information such as type-mistake etc. Following approach, I had tried:
Lookup from correct zipcode, city, and state infomration and do normalization which covers only 40-50% correct normalization
Tokenize the address and apply lots of conditional statement for correct zipcode, city, and state along with lookup infomration. Address field contains lot of rich infomration which isuseful to create lookup and data normalization. This approach covers only 50-60% correct normalization.
Data contains lots of historcial information and new data keep on coming. It's a iterative process to do data normalization.Is there a better way to do data normalization using machine learning technique i.e data learn itself from historical data and do the normalization?
This is quite general question, so I give an general answer.
Machine Learning should be used if nothing else can help. The simplest solution would be: If you have enough data (you can sacrifice some of them), the data is still of the same quality, try filtering based on some regular expression(s) - it is fast and straightforward
With machine learning, you will loose some time with training, and also the accuracy is not guaranteed. But of course, there are cases when ML can help a lot