kmeans clustering on the basis of fixed number of variables out of all variables

1.5k views Asked by At

I am beginner in R and data analysis.I have a data-set of around 2500 rows with 7 columns .I want to cluster the data-set with 15 centers but on the basis of just first two columns(keeping other columns intact with the clustered-data-set.

I also need to display the clustered data-set sorted on the basis of a third column.

Can someone help me with the required syntax ? let my csv file name be locdata.csv and first two columns be "lat" and "lon" and third column be "date".

1

There are 1 answers

0
MattV On

This should help you get there.

First create the dataset (alternatively, import the csv file):

set.seed(1)
df <- data.frame(matrix(rnorm(n=10000, mean=10, sd=20), ncol=8))
names(df)[1:3] <- c("lat", "lon", "date")
# Use df <- read.csv(..) instead to load from a file

require(dplyr)
cluster.df <- select(df, lat, lon) # Select the columns to cluster on
km <- kmeans(cluster.df, 15)

Next you can extract the clusters, using the fact that the kmeans retains the original order:

# Extract the clusters and add them to original data frame
df$cluster = km$cluster

# Sort on whatever column you prefer
df %>%
  arrange(date, cluster)