K-Means Clustering a list of US addresses based on drive time

1.1k views Asked by At

I have 8 traveling consultants that need to visit 155 groups across the continental united states. Is there a way to find the optimal 8 regions based of drive time using k-means clustering? I see there are some methods implemented already for other data sets, but they are not based off drive time. How will I need to manipulate my data set to make it usable?

Thank you in advance for any feedback. I am by no means a great coder, I have taken only a few introductory courses back in college.

2

There are 2 answers

3
andrew On BEST ANSWER

I think you are looking for "path planning" rather than clustering. The traveling salesman problem comes to mind

If you want to use clustering to find the individual regions you should find the coordinates for each location with respect to some global frame. One example would be using Latitude and longitude coordinates. Create an array X thats 155x2 where each row is a destination with the columns lat,long Then simply run matlab's kmeans something like

[idx,C] = kmeans(X,8);

should work nicely. This should be enough to get you started.

One issue with this approach is that it will group the sites by geographical location. Which isn't always the same as shortest travel time. For instance,

distance from (site A, site B) = 0.5 miles
distance from (site A, site C) = 2.0 miles

but getting from A-B requires going around a river and actual travel distance is 10 miles, whereas A-C is realistically 2.5 miles, clearly A-C is the better choice, but using global position alone wouldn't take this into account

2
knb On

This looks more like an integer optimization problem. It has nothing to do with clustering.

Reminds me of the case study "Assigning Regions to Sales Representatives [SRs] at Pfizer Turkey" by Murat Köksalan and Sakine Batun, INFORMS Transactions on Education 9(2), p.70-71, January 2009. http://pubsonline.informs.org/doi/abs/10.1287/ited.1090.0021ca .

I had to solve a simplified version of the problem in a MOOC recently.

"Since the SRs have to visit the MDs in their offices, it is important to minimize the total distance traveled by the SRs. This is the objective function. Each SR has an office in a certain brick, called their "center brick". We will compute the total distance traveled by an SR as the sum of the distances between the center brick and every other brick in that SR's territory."

You can optimize this for certain criteria. I cannot give any more details here because it is quite complicated.