I am using R and I have a dataframe containing strings of 4 unique letters (DNA). I am interested in counting the times certain unique combinations of letters occur in these strings. One of the possible scenarios is to detect how many times I see the same letter back to back.
I have come across several possible ways to achieve this using regex and packages like stringr but still have one problem.
These methods do not seem to iterate through the substring (letter by letter) and consider the next letter in line to count as an observance. This is a problem where the same letter is repeated more than 2x.
Example (where I want to count the times "CC" occurs and true_count column is my desired output):
sequence stringr_count true_count
ACCTACGT 1 1
CCCCCCCC 4 7
ACCCGCCT 2 3
I would recommend using
stringi::stri_count_fixed
as follows:With fixed pattern
stringi
is an order of magnitude faster than usinggregexpr
:Microbenchmark results:
You may also take a look at the Biostrings library. From my experience it is usually slower than working with
stringi
and requires some additional steps but provides many useful functions designed to work with biological sequences, includingcountPattern
:Microbenchmark results:
And just to be sure: