Extract relevant attributes from postal addresses data in order to do PCA on those Data (using R)

132 views Asked by At

I have big file which contains string information : postal addresses. Address example : "1780 wemmel rue hendrik de mol 59/7"

I need to do a PCA analysis on that Data in order to identify on the individuals graph the clusters that represent the physicals delivery posts (building, companies, ...). To do that I need to extract numeric (or not numeric) relevant information from the strings and make it my attributes, then I can analyze it using PCA.

I started with creating 36 attributes (A-Z and 0-9) that represent the occurrence of each alpha character and digit. But the PCA doesn't give a good result yet, I need to extract more attributes that can characterize the Data.

I need your ideas about what I can extract from the Data to have a good representation of the clusters on the individual graph. I'm using R.

Thank you.

1

There are 1 answers

1
cyberj0g On BEST ANSWER

I think that task is not for PCA. I would first try to introduce some kind of distance measure between 2 addresses. You can either use entire address as a single feature - then there're plenty of general-purpose string similarity measures, for example Levenshtein distance. There's a method in utils package. Or introduce more features, like number of building, postal code, etc. and use combination of Euclidean distance and text-similarity distance. Your 36 variables seem too much for the task. Anyway, your distance measure should give small value for 'close' addresses and large value for irrelevant addresses in your domain.

After deciding on distance measure and choosing features, apply k-means clustering with custom distance function to your data. You can use flexclust package for that. Nice suggestions for determining number of clusters can be found here.

With that you'll likely find your clusters. Good luck.