R n most similar time series - dwt clustering / nearest neighbour

216 views Asked by At

The data attached is a simplified example, as in reality I have hundreds of people and hundreds of points in time.

I am looking for a way to determine similar time series.

I have some code here to determine clusters, but this isn't exactly what I want.

What I would like is if I selected one person it would return the names of the n most similar time series.

I.e if n = 1, and I enter Bob it would return Dave, however if I entered Sam it would return Bob (with these names going into a new column with df). If n = 2 the first column would contain the most similar time series, and the second would contain the next most similar. This is similar to K nearest neighbours but across time series, so that each individual person has a different set of "neighbours".

If this is unfeasible, or too difficult I would alternatively like would to specify the number of people in each group, rather than the number of groups.

In this example I specified 4 groups, this does not make 4 groups of 2.

Group B contains 4 people, whilst C and D have only 1 person.

        hc@cluster
James            A
Dave             B
Bob              B
Joe              C
Robert           A
Michael          B
Sam              B
Steve            D

library(dtwclust)

df <- data.frame(
  row.names = c("James", "Dave", "Bob", "Joe", "Robert", "Michael", "Sam", "Steve"),
  Monday    = c(82, 46, 96, 57, 69, 28, 100, 10),
  Tuesday   = c(77, 62, 112, 66, 54, 34, 107, 20),
  Wednesday = c(77, 59, 109, 65, 50, 37, 114, 30),
  Thursday  = c(73, 92, 142, 77, 54, 30, 128, 40),
  Friday    = c(74, 49, 99, 90, 50, 25, 111, 50),
  Saturday  = c(68, 26, 76, 81, 42, 28, 63, 60),
  Sunday    = c(79, 37, 87, 73, 53, 33, 79, 70)
)

hc<- tsclust(df, type = "h", k = 4,
             preproc = zscore, seed = 899,
             distance = "sbd", centroid = shape_extraction,
             control = hierarchical_control(method = "average"))

plot(hc)

yo <- as.data.frame(hc@cluster)
yo$`hc@cluster` <- LETTERS[yo$`hc@cluster`]
print(yo)
1

There are 1 answers

2
Santiago I. Hurtado On BEST ANSWER

What you want to do is not to cluster the data, you want to order it according to one specific time-series, there lies the problem. To do what you want, first, you have to select a measure of "distance", that could be euclidean or correlation for example. In the next example, I provide a code with both measurements of distances (correlation and euclidean). It simple calculate the distance between the time-series, then sort it, and lastly pick up the N lower. Note that the selection of the measurement of distance will alter your results.

df <- data.frame(
  Monday    = c(82, 46, 96, 57, 69, 28, 100, 10),
  Tuesday   = c(77, 62, 112, 66, 54, 34, 107, 20),
  Wednesday = c(77, 59, 109, 65, 50, 37, 114, 30),
  Thursday  = c(73, 92, 142, 77, 54, 30, 128, 40),
  Friday    = c(74, 49, 99, 90, 50, 25, 111, 50),
  Saturday  = c(68, 26, 76, 81, 42, 28, 63, 60),
  Sunday    = c(79, 37, 87, 73, 53, 33, 79, 70)
)

df <- as.data.frame(t(df))
colnames(df) <- c("James", "Dave", "Bob", "Joe", "Robert", "Michael", "Sam", "Steve") 
  
get_nearest_n <- function(data, name, n = 1){
  #' n must be positive and integer
  #' name must be a column name of data
  #' data must be a dataframe
  
  serie <- data[,name]
  data <- data[,-which(colnames(data) == name)]
  
  dist <- sqrt(colSums((data-serie)**2))
      
  sorted_names <- names(sort(dist)[1:n])
  return(data[,sorted_names])
}

get_nearest_n2 <- function(data, name, n = 1){
  #' n must be positive and integer
  #' name must be a column name of data
  #' data must be a dataframe
  
  serie <- data[,name]
  data <- data[,-which(colnames(data) == name)]
  
  dist <- as.data.frame(cor(serie,data))
  
  sorted_names <- names(sort(dist,decreasing = T)[1:n])
  return(data[,sorted_names])
}
    
get_nearest_n(data = df, name = 'Bob', n = 3)
get_nearest_n2(data = df, name = 'Bob', n = 3)