numerical/user defined similarity calculation in reclin2

Question

numerical/user defined similarity calculation in reclin2

34 views Asked by Marius At 13 March 2024 at 17:11

I would like to use the reclin2 package or any other record matching package available in R in order to match reported transactions from each 2 market participants based on several numerical and categorical attributes (quantity, price, location, address…).

I wonder if there is a possiblity to use an R package and integrate a user-defined function for calculating the similarity per attribute between two records? For example, for comparing the values 90 and 100, is it possible to calculate the similarity as a percentage 90/100 = 0.9 instead of FALSE as they do not exactly match.

For example, all Report_IDs within the same Block_ID should be compared and a similarity score should be calculated.

reported transactions:

Data

structure(list(Report_ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
Block_ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3), Number = c(5,
3, 2, 10,     11, 100, 2, NA, 8, 2), Character = c("A", "A",
"B", "A", "B", "C", "D", "A", "D", "A")), class = "data.frame", 
row.names     = c(NA,
\-10L))

Report_ID	Block_ID	Number	Character
1	1	5	A
2	1	3	A
3	1	2	B
4	2	10	A
5	2	11	B
6	2	100	C
7	3	2	D
8	3	#NA	A
9	3	8	D
10	3	2	A

A result df could look like this: similarity dataframe

Report_ID_combination	Similarity_Number	Similarity_Character
1_2	0.60	1
1_3	0.40	0
2_3	0.67	0
4_5	0.91	0
4_6	0.10	0
5_6	0.11	0
7_8	#NA	0
7_9	0.25	1
7_10	1.00	0
8_9	#NA	0
8_10	#NA	1
9_10	0.25	0

As far as I know the typical to use packages for record matching/deduplication (reclin2, RecordLinkage, StatMatch, fedmatch) provide several distance measure for categorial features (Jaccard, Levenshtein, Cosine...), however for relative numerical comparison incl. NA handling I have not found a suitable package yet.

Any hint from someone working in the R/record matching universe on this kind of issue is appreciated.

Thanks,

Marius

Original Q&A

There are 2 answers

**Onyambu** · Answer 1 · 2024-03-13T17:27:00+00:00

Use the following:

#devtools::install_github('oonyambu/SLR')

similarity <- function(x, y){
  if (is.numeric(x)) list(num = min(x,y)/max(x,y))
  else list(num = +(x == y))
}

SLR::multiple_tests(.~Report_ID|Block_ID, df, similarity)|>
   tidyr::pivot_wider(names_from = response, values_from = num)

# A tibble: 12 × 4
   Block_ID grp   Number Character
   <chr>    <chr>  <dbl>     <dbl>
 1 1        1:2    0.6           1
 2 1        1:3    0.4           0
 3 1        2:3    0.667         0
 4 2        4:5    0.909         0
 5 2        4:6    0.1           0
 6 2        5:6    0.11          0
 7 3        7:8   NA             0
 8 3        7:9    0.25          1
 9 3        7:10   1             0
10 3        8:9   NA             0
11 3        8:10  NA             1
12 3        9:10   0.25          0

**Onyambu** · Answer 2 · 2024-03-13T22:59:29+00:00

Here is a base R solution:

fm <- function(x){
  fn <- function(x){
    vals <- if(is.numeric(x)) min(x)/max(x) else +(x[1] == x[2])
    setNames(vals, paste0(names(x), collapse = "_"))
  }
  r <- unlist(combn(x, 2, fn, FALSE))
  array2DF(structure(r, dim = length(r), dimnames = list(grp = names(r))))
}


combn_fun <- function(data, groups, within = NULL, vars = NULL){
  if(is.null(vars)) vars <- setdiff(names(data), c(groups, within))
  if(is.null(within)) within <- ave(groups, groups, FUN = seq_along)
  
  within <- interaction(data[within])
  groups <- data[groups]
  sapply(data[vars], \(y) tapply(setNames(y, within), groups,fm))|>
    array2DF()|>
    reshape(v.names = 'Value',timevar = 'Var2', direction = 'wide', 
            idvar = c('Var1', 'grp'))|>
    data.frame(row.names = NULL)
}

combn_fun(df, 'Block_ID', 'Report_ID')
   Var1  grp Value.Number Value.Character
1     1  1_2    0.6000000               1
2     1  1_3    0.4000000               0
3     1  2_3    0.6666667               0
4     2  4_5    0.9090909               0
5     2  4_6    0.1000000               0
6     2  5_6    0.1100000               0
7     3  7_8           NA               0
8     3  7_9    0.2500000               1
9     3 7_10    1.0000000               0
10    3  8_9           NA               0
11    3 8_10           NA               1
12    3 9_10    0.2500000               0

TechQA.

numerical/user defined similarity calculation in reclin2

Data

There are 2 answers

Related Questions in R

Related Questions in DUPLICATES

Related Questions in MATCHING

Related Questions in PATTERN-RECOGNITION

Related Questions in RECORD-LINKAGE

Popular Questions

Trending Questions