Count Unique Occurrences In Text String In R

1.2k views Asked by At

I am using R and I have a dataframe containing strings of 4 unique letters (DNA). I am interested in counting the times certain unique combinations of letters occur in these strings. One of the possible scenarios is to detect how many times I see the same letter back to back.

I have come across several possible ways to achieve this using regex and packages like stringr but still have one problem.

These methods do not seem to iterate through the substring (letter by letter) and consider the next letter in line to count as an observance. This is a problem where the same letter is repeated more than 2x.

Example (where I want to count the times "CC" occurs and true_count column is my desired output):

sequence  stringr_count  true_count
ACCTACGT      1             1
CCCCCCCC      4             7
ACCCGCCT      2             3
1

There are 1 answers

0
zero323 On

I would recommend using stringi::stri_count_fixed as follows:

> library(stringi)
> seqs <- data.frame(sequence=c('ACCTACGT', 'CCCCCCCC', 'ACCCGCCT'))
> opts <- stri_opts_fixed(overlap=TRUE)
> seqs$true_count <- stri_count_fixed(str=seqs$sequence, pattern='CC', opts_fixed=opts)
> seqs
  sequence true_count
1 ACCTACGT          1
2 CCCCCCCC          7
3 ACCCGCCT          3

With fixed pattern stringi is an order of magnitude faster than using gregexpr:

library(microbenchmark)

# Answer provided by @user20650 in the comments
f1 <- function(x) sapply(gregexpr('(?=CC)', x, perl=T) , function(i) sum(i>0))

f2 <- function(x) stri_count_fixed(
    str=x, pattern='CC',
    opts_fixed=stri_opts_fixed(overlap=TRUE))

# Generate random sequences
sequence <- stri_rand_strings(n=10000, length=1000, pattern='[ATGC]')

Microbenchmark results:

> microbenchmark(f1(sequence), f2(sequence))
Unit: milliseconds
         expr       min        lq      mean    median       uq       max neval
 f1(sequence) 290.90393 304.87107 329.11392 313.39819 327.9860 437.10229   100
 f2(sequence)  20.99733  21.12559  21.39206  21.26017  21.4377  27.68867   100

You may also take a look at the Biostrings library. From my experience it is usually slower than working with stringi and requires some additional steps but provides many useful functions designed to work with biological sequences, including countPattern:

library(Biostrings)

bsequence <- DNAStringSet(sequence)
f3 <- function(x) vcountPattern('CC', x)

Microbenchmark results:

> microbenchmark(f2(sequence), f3(bsequence))
Unit: milliseconds
          expr      min       lq     mean   median       uq      max neval
  f2(sequence) 20.83336 21.11473 21.36759 21.25088 21.45000 23.80708   100
 f3(bsequence) 86.95430 89.10023 89.51665 89.37103 89.87699 91.88203   100

And just to be sure:

> identical(f1(seqs$sequence), f2(seqs$sequence),  f3(BStringSet(seqs$sequence))
[1] TRUE