R - how to find longest duplicate sequences and their frequencies

Question

R - how to find longest duplicate sequences and their frequencies

463 views Asked by Nena At 05 January 2025 at 08:11

I have some data that looks like this:

29  32  33  46  47  48
29  34  35  39  40  43
29  35  36  38  41  43
30  31  32  34  36  49
30  32  35  40  43  44
39  40  43  46  47  50
 7  8    9  39  40  43
 1  7    8  12  40  43

There is actually a lot more data, but I wanted to keep this short. I'd like to find a way in R to find the longest common subsequence for all rows and sort by the frequency (decreasing) where only those common subsequences that have more than one element in the sequence and more than one frequency is reported. Is there a way to do this in R?

So example result would be something like:

[29] 3
[30] 2 
...
( etc for all the single duplicates across each row and their frequencies )
...
[46  47] 2
[39  40  43] 3
[40, 43] 2

Original Q&A

There are 1 answers

**CPak** · Answer 1 · 2017-09-14T22:23:26+00:00

Seems like you are asking two different kinds of questions. You want 1) length of contiguous runs of a single value columnwise and 2) count (non-contiguous) of ngrams (made rowwise) but counted columnwise.

library(tidyverse)
# single number contiguous runs by column
single <- Reduce("rbind", apply(df, 2, function(x) tibble(val=rle(x)$values, occurrence=rle(x)$lengths) %>% filter(occurrence>1)))

Output of single

    val occurrence
  <int>      <int>
1    29          3
2    30          2
3    40          2
4    43          2
5    43          2

# ngram numbers by row (count, non-contiguous)
restof <- Reduce("rbind", lapply(1:(ncol(df)-1), function(z) {
    nruns <- t(apply(df, 1, function(x) sapply(head(seq_along(x),-z), function(y) paste(x[y:(y+z)], collapse=" "))) )
    Reduce("rbind", apply(nruns, 2, function(x) tibble(val=names(table(x)), occurrence=c(table(x))) %>% filter(occurrence>1)))
}))

Output of ngrams

       val occurrence
     <chr>      <int>
1    39 40          2
2    46 47          2
3    40 43          3
4 39 40 43          2

Combining the data

ans <- rbind(single, restof)

Output

       val occurrence
     <chr>      <int>
1       29          3
2       30          2
3       40          2
4       43          2
5       43          2
6    39 40          2
7    46 47          2
8    40 43          3
9 39 40 43          2

Your data

df <- read.table(text="29  32  33  46  47  48
29  34  35  39  40  43
29  35  36  38  41  43
30  31  32  34  36  49
30  32  35  40  43  44
39  40  43  46  47  50
 7  8    9  39  40  43
 1  7    8  12  40  43")

TechQA.

R - how to find longest duplicate sequences and their frequencies

There are 1 answers

Related Questions in R

Related Questions in DATAFRAME

Related Questions in SUBSEQUENCE

Popular Questions

Popular Tags

Trending Questions