identify sequences of approximately equivalent values in a series using R

Question

identify sequences of approximately equivalent values in a series using R

68 views Asked by cathalcom At 18 February 2023 at 19:04

I have a series of values that includes strings of values that are close to each other, for example the sequences below. Note that roughly around the places I have categorized the values in V1 with distinct values in V2, the range of the values changes. That is, all the values called 1 in V2 are within 20 points of each other. All the values marked 2 in V2 are within 20 points of each other. All the values marked 3 are within 20 points of each other, etc. Notice that the values are not identical (they are all different). But instead, they cluster around a common value.

I identified these clusters manually. How could I automate it?

        V1 V2
1  399.710  1
2  403.075  1
3  405.766  1
4  407.112  1
5  408.458  1
6  409.131  1
7  410.477  1
8  411.150  1
9  412.495  1
10 332.419  2
11 330.400  2
12 329.054  2
13 327.708  2
14 326.363  2
15 325.017  2
16 322.998  2
17 319.633  2
18 314.923  2
19 288.680  3
20 285.315  3
21 283.969  3
22 281.950  3
23 279.932  3
24 276.567  3
25 273.875  3
26 272.530  3
27 271.857  3
28 272.530  3
29 273.875  3
30 274.548  3
31 275.894  3
32 275.894  3
33 276.567  3
34 277.240  3
35 278.586  3
36 279.932  3
37 281.950  3
38 284.642  3
39 288.007  3
40 291.371  3
41 294.063  4
42 295.409  4
43 296.754  4
44 297.427  4
45 298.100  4
46 299.446  4
47 300.792  4
48 303.484  4
49 306.848  4
50 327.708  5
51 309.540  6
52 310.213  6
53 309.540  6
54 306.848  6
55 304.156  6
56 302.811  6
57 302.811  6
58 304.156  6
59 305.502  6
60 306.175  6
61 306.175  6
62 304.829  6

I haven't tried anything yet, I don't know how to do this.

Original Q&A

There are 2 answers

**jay.sf** · Answer 1 · 2023-02-18T19:17:48+00:00

Using dist and hclust with cutree to detect clusters, but with unique levels at the breaks.

hc <- hclust(dist(x))
cl <- cutree(hc, k=6)
data.frame(x, seq=cumsum(c(0, diff(cl)) != 0) + 1)
#          x seq
# 1  399.710   1
# 2  403.075   1
# 3  405.766   1
# 4  407.112   1
# 5  408.458   1
# 6  409.131   1
# 7  410.477   1
# 8  411.150   1
# 9  412.495   1
# 10 332.419   2
# 11 330.400   2
# 12 329.054   2
# 13 327.708   2
# 14 326.363   2
# 15 325.017   2
# 16 322.998   2
# 17 319.633   3
# 18 314.923   3
# 19 288.680   4
# 20 285.315   4
# 21 283.969   4
# 22 281.950   4
# 23 279.932   4
# 24 276.567   5
# 25 273.875   5
# 26 272.530   5
# 27 271.857   5
# 28 272.530   5
# 29 273.875   5
# 30 274.548   5
# 31 275.894   5
# 32 275.894   5
# 33 276.567   5
# 34 277.240   5
# 35 278.586   6
# 36 279.932   6
# 37 281.950   6
# 38 284.642   6
# 39 288.007   6
# 40 291.371   6
# 41 294.063   7
# 42 295.409   7
# 43 296.754   7
# 44 297.427   7
# 45 298.100   7
# 46 299.446   7
# 47 300.792   7
# 48 303.484   7
# 49 306.848   7
# 50 327.708   8
# 51 309.540   9
# 52 310.213   9
# 53 309.540   9
# 54 306.848   9
# 55 304.156   9
# 56 302.811   9
# 57 302.811   9
# 58 304.156   9
# 59 305.502   9
# 60 306.175   9
# 61 306.175   9
# 62 304.829   9

However, the dendrogram suggests rather k=4 clusters instead of 6, but it is arbitrary.

plot(hc)
abline(h=30, lty=2, col=2)
abline(h=18.5, lty=2, col=3)
abline(h=14, lty=2, col=4)
legend('topright', lty=2, col=2:4, legend=paste(c(4, 5, 7), 'cluster'), cex=.8)

Data:

x <- c(399.71, 403.075, 405.766, 407.112, 408.458, 409.131, 410.477, 
411.15, 412.495, 332.419, 330.4, 329.054, 327.708, 326.363, 325.017, 
322.998, 319.633, 314.923, 288.68, 285.315, 283.969, 281.95, 
279.932, 276.567, 273.875, 272.53, 271.857, 272.53, 273.875, 
274.548, 275.894, 275.894, 276.567, 277.24, 278.586, 279.932, 
281.95, 284.642, 288.007, 291.371, 294.063, 295.409, 296.754, 
297.427, 298.1, 299.446, 300.792, 303.484, 306.848, 327.708, 
309.54, 310.213, 309.54, 306.848, 304.156, 302.811, 302.811, 
304.156, 305.502, 306.175, 306.175, 304.829)

**zephryl** · Answer 2 · 2023-02-18T21:20:11+00:00

This solution iterates over every value, checks the range of all values in the group up to that point, and starts a new group if the range is greater than a threshold.

maxrange <- 18

grp_start <- 1
grp_num <- 1
V3 <- numeric(length(dat$V1))
for (i in seq_along(dat$V1)) {
  grp <- dat$V1[grp_start:i]
  if (max(grp) - min(grp) > maxrange) {
    grp_num <- grp_num + 1 
    grp_start <- i
  }
  V3[[i]] <- grp_num
}

cbind(dat, V3)

        V1 V2 V3
1  399.710  1  1
2  403.075  1  1
3  405.766  1  1
4  407.112  1  1
5  408.458  1  1
6  409.131  1  1
7  410.477  1  1
8  411.150  1  1
9  412.495  1  1
10 332.419  2  2
11 330.400  2  2
12 329.054  2  2
13 327.708  2  2
14 326.363  2  2
15 325.017  2  2
16 322.998  2  2
17 319.633  2  2
18 314.923  2  2
19 288.680  3  3
20 285.315  3  3
21 283.969  3  3
22 281.950  3  3
23 279.932  3  3
24 276.567  3  3
25 273.875  3  3
26 272.530  3  3
27 271.857  3  3
28 272.530  3  3
29 273.875  3  3
30 274.548  3  3
31 275.894  3  3
32 275.894  3  3
33 276.567  3  3
34 277.240  3  3
35 278.586  3  3
36 279.932  3  3
37 281.950  3  3
38 284.642  3  3
39 288.007  3  3
40 291.371  3  4
41 294.063  4  4
42 295.409  4  4
43 296.754  4  4
44 297.427  4  4
45 298.100  4  4
46 299.446  4  4
47 300.792  4  4
48 303.484  4  4
49 306.848  4  4
50 327.708  5  5
51 309.540  6  6
52 310.213  6  6
53 309.540  6  6
54 306.848  6  6
55 304.156  6  6
56 302.811  6  6
57 302.811  6  6
58 304.156  6  6
59 305.502  6  6
60 306.175  6  6
61 306.175  6  6
62 304.829  6  6

A threshold of 18 reproduces your groups, except that group 4 starts one row earlier. You could use a higher threshold, but then group 6 would start later than you have it.

TechQA.

identify sequences of approximately equivalent values in a series using R

There are 2 answers

Related Questions in R

Related Questions in SEQUENCE

Related Questions in CATEGORIZATION

Popular Questions

Trending Questions