numerical/user defined similarity calculation in reclin2

34 views Asked by At

I would like to use the reclin2 package or any other record matching package available in R in order to match reported transactions from each 2 market participants based on several numerical and categorical attributes (quantity, price, location, address…).

I wonder if there is a possiblity to use an R package and integrate a user-defined function for calculating the similarity per attribute between two records? For example, for comparing the values 90 and 100, is it possible to calculate the similarity as a percentage 90/100 = 0.9 instead of FALSE as they do not exactly match.

For example, all Report_IDs within the same Block_ID should be compared and a similarity score should be calculated.

reported transactions:

Data
structure(list(Report_ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
Block_ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3), Number = c(5,
3, 2, 10,     11, 100, 2, NA, 8, 2), Character = c("A", "A",
"B", "A", "B", "C", "D", "A", "D", "A")), class = "data.frame", 
row.names     = c(NA,
\-10L))
Report_ID Block_ID Number Character
1 1 5 A
2 1 3 A
3 1 2 B
4 2 10 A
5 2 11 B
6 2 100 C
7 3 2 D
8 3 #NA A
9 3 8 D
10 3 2 A

A result df could look like this: similarity dataframe

Report_ID_combination Similarity_Number Similarity_Character
1_2 0.60 1
1_3 0.40 0
2_3 0.67 0
4_5 0.91 0
4_6 0.10 0
5_6 0.11 0
7_8 #NA 0
7_9 0.25 1
7_10 1.00 0
8_9 #NA 0
8_10 #NA 1
9_10 0.25 0

As far as I know the typical to use packages for record matching/deduplication (reclin2, RecordLinkage, StatMatch, fedmatch) provide several distance measure for categorial features (Jaccard, Levenshtein, Cosine...), however for relative numerical comparison incl. NA handling I have not found a suitable package yet.

Any hint from someone working in the R/record matching universe on this kind of issue is appreciated.

Thanks,

Marius

2

There are 2 answers

2
Onyambu On

Use the following:

#devtools::install_github('oonyambu/SLR')

similarity <- function(x, y){
  if (is.numeric(x)) list(num = min(x,y)/max(x,y))
  else list(num = +(x == y))
}

SLR::multiple_tests(.~Report_ID|Block_ID, df, similarity)|>
   tidyr::pivot_wider(names_from = response, values_from = num)

# A tibble: 12 Ă— 4
   Block_ID grp   Number Character
   <chr>    <chr>  <dbl>     <dbl>
 1 1        1:2    0.6           1
 2 1        1:3    0.4           0
 3 1        2:3    0.667         0
 4 2        4:5    0.909         0
 5 2        4:6    0.1           0
 6 2        5:6    0.11          0
 7 3        7:8   NA             0
 8 3        7:9    0.25          1
 9 3        7:10   1             0
10 3        8:9   NA             0
11 3        8:10  NA             1
12 3        9:10   0.25          0
0
Onyambu On

Here is a base R solution:

fm <- function(x){
  fn <- function(x){
    vals <- if(is.numeric(x)) min(x)/max(x) else +(x[1] == x[2])
    setNames(vals, paste0(names(x), collapse = "_"))
  }
  r <- unlist(combn(x, 2, fn, FALSE))
  array2DF(structure(r, dim = length(r), dimnames = list(grp = names(r))))
}


combn_fun <- function(data, groups, within = NULL, vars = NULL){
  if(is.null(vars)) vars <- setdiff(names(data), c(groups, within))
  if(is.null(within)) within <- ave(groups, groups, FUN = seq_along)
  
  within <- interaction(data[within])
  groups <- data[groups]
  sapply(data[vars], \(y) tapply(setNames(y, within), groups,fm))|>
    array2DF()|>
    reshape(v.names = 'Value',timevar = 'Var2', direction = 'wide', 
            idvar = c('Var1', 'grp'))|>
    data.frame(row.names = NULL)
}

combn_fun(df, 'Block_ID', 'Report_ID')
   Var1  grp Value.Number Value.Character
1     1  1_2    0.6000000               1
2     1  1_3    0.4000000               0
3     1  2_3    0.6666667               0
4     2  4_5    0.9090909               0
5     2  4_6    0.1000000               0
6     2  5_6    0.1100000               0
7     3  7_8           NA               0
8     3  7_9    0.2500000               1
9     3 7_10    1.0000000               0
10    3  8_9           NA               0
11    3 8_10           NA               1
12    3 9_10    0.2500000               0