JaroWinkler Method --> Identifying Character/Numeric spots in a string

Question

JaroWinkler Method --> Identifying Character/Numeric spots in a string

244 views Asked by user2813606 At 30 November 2020 at 19:27

I am working on a problem to identify if a specified string has the correct format. I am attempting to use a fuzzy matching technique, JaroWinkler, to find the similarity score between a reference string and the strings of interest.

The correct format for the string follows this order (N=number, C=character): NNNCCCCCC

I found a similar problem on another StackOverflow question and edited the code a little here:

library(RecordLinkage)
library(dplyr)
library(stringdist)

ref <-c('123ABCDEF')
words <-c("456GHIJKL","123ABCDEF","78D78DAA2","660ABCDEF")

wordlist <- expand.grid(words = words, ref = ref, stringsAsFactors = FALSE)

df <- wordlist %>% 
        group_by(words) %>% 
        mutate(match_score = jarowinkler(words, ref))

df <- as.data.frame(df)
df

I know the JaroWinkler method is used for identifying common characters and considering string distance, but I'm not sure if this is the best method. Ideally, I'd like for the first and last elements in the words vector to be classified as correct and receive scores of 1 since they have the NNNCCCCCC format.

However, when I run this code, I get the following:

      words       ref match_score
1 456GHIJKL 123ABCDEF   0.0000000
2 123ABCDEF 123ABCDEF   1.0000000
3 78D78DAA2 123ABCDEF   0.3148148
4 660ABCDEF 123ABCDEF   0.7777778

Is there a better method for this type of matching exercise? Any help would be appreciated! Thank you!

Original Q&A

There are 1 answers

**deschen** · Accepted Answer · 2020-11-30T21:23:44+00:00

As suggested in the comment above, I would do an exact string matching. Only uncertainty for now is what do you mean with "characters"? Only letters from A-Z or als e.g. punctuations? If only letters, see the code below.

library(tidyverse)

words <-c("456GHIJKL","123ABCDEF","78D78DAA2","660ABCDEF")

str_detect(words, "[[:digit:]]{3}(?=[[:alpha:]]{6})")

which gives:

[1]  TRUE  TRUE FALSE  TRUE

Updating the answer to reflect the TOs changed pattern

words <-c("456GHIJKL","123ABCDEF","78D78DAA2","660ABCDEF", "660A7CDEF")

str_detect(words, "[[:digit:]]{3}(?=[[:alpha:]]{1})(?=[[:digit:]]{1}|[[:alpha:]]{1})(?=[[:alpha:]]{5})")

gives:

[1]  TRUE  TRUE FALSE  TRUE  TRUE

TechQA.

JaroWinkler Method --> Identifying Character/Numeric spots in a string

There are 1 answers

Related Questions in R

Related Questions in COMPARISON

Related Questions in FUZZY-SEARCH

Related Questions in STRINGDIST

Related Questions in JARO-WINKLER

Popular Questions

Trending Questions