JaroWinkler Method --> Identifying Character/Numeric spots in a string

192 views Asked by At

I am working on a problem to identify if a specified string has the correct format. I am attempting to use a fuzzy matching technique, JaroWinkler, to find the similarity score between a reference string and the strings of interest.

The correct format for the string follows this order (N=number, C=character): NNNCCCCCC

I found a similar problem on another StackOverflow question and edited the code a little here:

library(RecordLinkage)
library(dplyr)
library(stringdist)

ref <-c('123ABCDEF')
words <-c("456GHIJKL","123ABCDEF","78D78DAA2","660ABCDEF")

wordlist <- expand.grid(words = words, ref = ref, stringsAsFactors = FALSE)

df <- wordlist %>% 
        group_by(words) %>% 
        mutate(match_score = jarowinkler(words, ref))

df <- as.data.frame(df)
df

I know the JaroWinkler method is used for identifying common characters and considering string distance, but I'm not sure if this is the best method. Ideally, I'd like for the first and last elements in the words vector to be classified as correct and receive scores of 1 since they have the NNNCCCCCC format.

However, when I run this code, I get the following:

      words       ref match_score
1 456GHIJKL 123ABCDEF   0.0000000
2 123ABCDEF 123ABCDEF   1.0000000
3 78D78DAA2 123ABCDEF   0.3148148
4 660ABCDEF 123ABCDEF   0.7777778

Is there a better method for this type of matching exercise? Any help would be appreciated! Thank you!

1

There are 1 answers

3
deschen On BEST ANSWER

As suggested in the comment above, I would do an exact string matching. Only uncertainty for now is what do you mean with "characters"? Only letters from A-Z or als e.g. punctuations? If only letters, see the code below.

library(tidyverse)

words <-c("456GHIJKL","123ABCDEF","78D78DAA2","660ABCDEF")

str_detect(words, "[[:digit:]]{3}(?=[[:alpha:]]{6})")

which gives:

[1]  TRUE  TRUE FALSE  TRUE

Updating the answer to reflect the TOs changed pattern

words <-c("456GHIJKL","123ABCDEF","78D78DAA2","660ABCDEF", "660A7CDEF")

str_detect(words, "[[:digit:]]{3}(?=[[:alpha:]]{1})(?=[[:digit:]]{1}|[[:alpha:]]{1})(?=[[:alpha:]]{5})")

gives:

[1]  TRUE  TRUE FALSE  TRUE  TRUE