I have a database without about 700 records, each with their own zip code. I want to show the rows (by unique ID) that fall within a 100 mile cluster. I also have a zip code distance DB from the National Bureau of Economic Research (https://data.nber.org/distance/2022/100miles/). This file is quite large but looks like the table below.
zip1 miles_to_zip2 zip2
1534 50 1001
1534 44 1002
1534 48 1003
This data has 13m rows, but I really only need the distance for the zip codes in my database (about 700 different ones).
I can use Python, R, or Power BI to do this, depending on which is the simplest. How do I group all my unique IDs by a cluster group that identifies locations within 100 miles of each other? For reference, the data I want to analyze looks similar to the table below.
ID ZipCode Sales
1 15110 10,000
2 15115 15,000
3 98000 2,000
4 98001 10,000
5 70570 10,000
My goal is essentially to return a column called cluster that would assign ID 1 and 2 to cluster1 and ID 3 and 4 to cluster 2, and 5 to cluster 3 (based on the 100 mile criteria).