Extract relevant attributes from postal addresses data in order to do PCA on those Data (using R)

Question

Extract relevant attributes from postal addresses data in order to do PCA on those Data (using R)

177 views Asked by Taoufiq Mouhcine At 17 June 2015 at 09:51

I have big file which contains string information : postal addresses. Address example : "1780 wemmel rue hendrik de mol 59/7"

I need to do a PCA analysis on that Data in order to identify on the individuals graph the clusters that represent the physicals delivery posts (building, companies, ...). To do that I need to extract numeric (or not numeric) relevant information from the strings and make it my attributes, then I can analyze it using PCA.

I started with creating 36 attributes (A-Z and 0-9) that represent the occurrence of each alpha character and digit. But the PCA doesn't give a good result yet, I need to extract more attributes that can characterize the Data.

I need your ideas about what I can extract from the Data to have a good representation of the clusters on the individual graph. I'm using R.

Thank you.

Original Q&A

There are 1 answers

**cyberj0g** · Accepted Answer · 2015-06-17T10:59:42+00:00

I think that task is not for PCA. I would first try to introduce some kind of distance measure between 2 addresses. You can either use entire address as a single feature - then there're plenty of general-purpose string similarity measures, for example Levenshtein distance. There's a method in utils package. Or introduce more features, like number of building, postal code, etc. and use combination of Euclidean distance and text-similarity distance. Your 36 variables seem too much for the task. Anyway, your distance measure should give small value for 'close' addresses and large value for irrelevant addresses in your domain.

After deciding on distance measure and choosing features, apply k-means clustering with custom distance function to your data. You can use flexclust package for that. Nice suggestions for determining number of clusters can be found here.

With that you'll likely find your clusters. Good luck.

TechQA.

Extract relevant attributes from postal addresses data in order to do PCA on those Data (using R)

There are 1 answers

Related Questions in R

Related Questions in DATA-MINING

Related Questions in TEXT-MINING

Related Questions in PCA

Related Questions in TEXT-EXTRACTION

Popular Questions

Popular Tags

Trending Questions