How to subtract large binary numbers?

69 views Asked by At

I want to compare what students answered in a multiple answer test and see which students deviate by only one answer. I've already converted answers (e.g. ABCDE) into hits or misses (eg. 00101). This test has 45 questions, so answer and hits strings are very long (e.g. 100000000000000010000010010001001000000000100). Base R can't even deal with these strings as numbers because of the floating-point error.

I have texts_1 with answers that got n questions right, and texts_2 with answers that got n+1 questions right. Then I compare every line in texts_1 with every line in texts_2 to look for strings with one character of difference.

One way to do this is with adist.

if (adist(texts_1[line_1], texts_2[line_2]) == 1) { ... }

If the result is 1, I know there's only one difference in the texts. This works, but the problem is that adist is very slow, and I have thousand of comparisons to do. It took 4 hours to make 20000 x 30000 comparisons.

My idea was to treat the hits and misses string as a number, and subtract them. If the answer was a power of 10, I'd know there was only one question different. e.g. 1101 - 1001 is a power of 10. However, R can't deal with numbers this big. Is there a package that lets me deal with binary numbers this large? Subtract and divide? Also some binary numbers will lead with zeroes.

tl;dr: How to subtract 001111111111111101111011111111111111111111111 - 001111111111111101111011111111111110111111111 in R? And then check if the answer is a power of 10?

3

There are 3 answers

0
Andre Wildberg On

If I understand the problem right this might help

If these are your tests

set.seed(42)
stud_tests1 <- replicate(20, paste(sample(0:1, 45, replace=T), collapse=""))
stud_tests2 <- replicate(20, paste(sample(0:1, 45, replace=T), collapse=""))

stud_tests2[6] <- "010010110110101001110110111010110111110110011"

Getting the tests that only differ by one

one <- strsplit(stud_tests1, "")
two <- strsplit(stud_tests2, "")

res <- sapply(one, \(x) sapply(two, \(y) sum(x != y) == 1))

Getting the test numbers

cbind(one = ceiling(which(res) / length(one)), two = which(res) %% length(two))
     one two
[1,]  18   6
1
jay.sf On

You could strsplit into a "numeric" matrix, use dist, and finally which() with arr.ind=TRUE. This is about 80 times faster.

Example on a vector of length 1e3:

> ## using `adist`
> system.time({
+   ad <- adist(a) |> as.matrix()
+   ares <- which(ad == 1, arr.ind=TRUE)
+ })
   user  system elapsed 
  8.642   0.000   8.633 
> 
> ## using `dist`
> system.time({
+   b <- strsplit(a, '') |> do.call(what='rbind') |> `mode<-`('numeric')
+   bd <- dist(b) |> as.matrix()
+   bres <- which(bd == 1, arr.ind=TRUE)
+ })
   user  system elapsed 
  0.098   0.011   0.109 
> 
> stopifnot(all.equal(ares, bres, check.attributes=FALSE))
> 
> head(bres)
    row col
342 342   4
475 475  29
587 587  62
373 373 131
650 650 156
710 710 174

Let's view a random result.

> set.seed(42)
> a[bres[sample.int(nrow(bres), 1L), ]] |> as.data.frame()
          a[bres[sample.int(nrow(bres), 1L), ]]
1 101111011111110111111111111111111111111111111
2 101111011111110111111111111111111111101111111
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^

Data:

set.seed(42)
> a <- replicate(1e3, sample(0:1, 45, replace=TRUE, prob=c(6, 39)) |> 
+                  paste(collapse=''))
0
jblood94 On

A couple faster options for getting the pairwise differences.

library(Rfast) # for `Dist`

f1 <- function(a) {
  # from @jay.sf
  b <- strsplit(a, '') |> do.call(what='rbind') |> `mode<-`('numeric')
  dist(b, "manhattan") |> as.matrix()
}

f2 <- function(a) {
  Dist(
    matrix(utf8ToInt(paste0(a, collapse = "")), length(a), byrow = TRUE),
    "manhattan"
  )
}

f3 <- function(a) {
  n <- nchar(a[1])
  (n - crossprod(matrix(2*utf8ToInt(paste0(a, collapse = "")) - 97, n)))/2
}

The pairs that differ by 1 could be had by:

which(f3(a) == 1, TRUE)
#>      row col
#> [1,] 728 321
#> [2,] 841 321
#> [3,] 971 690
#> [4,] 321 728
#> [5,] 321 841
#> [6,] 989 946
#> [7,] 690 971
#> [8,] 946 989

Timing

microbenchmark::microbenchmark(
  f1 = unname(f1(a)),
  f2 = f2(a),
  f3 = f3(a),
  check = "identical"
)
#> Unit: milliseconds
#>  expr     min       lq     mean   median       uq      max neval
#>    f1 78.7927 83.75370 88.65015 85.40460 87.88880 132.4833   100
#>    f2 63.5259 64.96575 66.80393 65.76905 67.50755  83.3841   100
#>    f3 13.0596 13.32905 15.86264 13.67160 17.24970  58.8271   100

Data:

(seed <- sample(.Machine$integer.max, 1))
#> [1] 696435183
set.seed(seed)
a <- replicate(1e3, paste0(sample(0:1, 45, 1, c(0.18, 0.82)), collapse = ""))