I have been working around this problem for a while without finding a satisfactory solution.
I have data in a binary sparse matrix (TermDocumentMatrix) with dim ([1] 340436 763717
). I here use an extract as proof of concept:
m = structure(list(i = c(1L, 2L, 5L, 2L, 4L, 3L, 5L, 4L), j = c(1L, 1L, 1L, 2L, 2L,
3L, 3L, 3L), v = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), nrow = 5L, ncol = 3L,
dimnames = list(Terms = c("action", "activities", "advisory", "alike", "almanac"),
Docs = c("1000008721", "1000010083","1000013295"))),
class = c("TermDocumentMatrix", "simple_triplet_matrix"), weighting = c("binary", "bin"))
inspect(m)
<<TermDocumentMatrix (terms: 5, documents: 3)>>
Non-/sparse entries: 8/7
Sparsity : 47%
Maximal term length: 10
Weighting : binary (bin)
Sample :
Docs
Terms 1000008721 1000010083 1000013295
action 1 0 0
activities 1 1 0
advisory 0 0 1
alike 0 1 1
almanac 1 0 1
I want to normalize to unit length every vectorized document, and then obtain a (sparse) matrix with the Docs on rows and Docs on cols and the dot product of the corresponding normalized vectors as values.
Expected output:
Sparse Matrix:
Docs
Docs 1000008721 1000010083 1000013295 ... N
1000008721 1.0000000 0.4082483 0.3333333 .
1000010083 0.4082483 1.0000000 0.4082483 .
1000013295 0.3333333 0.4082483 1.0000000 .
...
N . . .
or also: data.table
ID1 ID2 v
1000008721 1000008721 1
1000010083 1000008721 0.4082483
1000013295 1000008721 0.3333333
... ... ...
This would be easy to achieve with crossprod_simple_triplet_matrix(m)
after applying the normalization with a function that divides every value for the norm. The euclidean norm in the with a binary vector reduces to sqrt(col_sums(m))
.
Since I cannot by as.matrix()
transformation ("Error: cannot allocate vector of size 968.6 Gb"), and I couldn't find any other way, I used data.table that may circumvent the need to apply a function over a sparse matrix:
# exploit the triples and manipulate through data.table
dt = as.data.table(list(i=m$i,j=m$j,v=m$v))
# obtain euclidean norm for every column
dt[,e.norm:=list(as.numeric(sqrt(sum(v)))),by=j]
# divide the v for the corresponding group, subset and replace
dt = dt[,v.norm:=v/e.norm][,.(i,j,v.norm)][,v:=v.norm][,.(i,j,v)]
m$v <- dt$v
inspect(m)
Docs
Terms 1000008721 1000010083 1000013295
action 0.5773503 0.0000000 0.0000000
activities 0.5773503 0.7071068 0.0000000
advisory 0.0000000 0.0000000 0.5773503
alike 0.0000000 0.7071068 0.5773503
almanac 0.5773503 0.0000000 0.5773503
(What would the equivalent of this (maybe with slam) be?)
QUESTION: Given that crossprod_simple_triplet_matrix(tdm)
still returns a dense matrix (hence memory error) can you think about a similar data.table solution to return a sparse matrix with the cross product of two sparse matrices, or any alternative way?
A 340436 x 763717 sparse matrix with 35879680 non-zero elements will result in a very large object (~30 GB). My machine isn't able to hold that object in memory with 16GB RAM. However, the cross product is straightforward to do piecemeal. The function
bigcrossprod
below performs the cross product in piecemeal, converts the results todata.table
objects, and thenrbinds
the objects. Thecrossprod
operation is broken intonseg
separate operations.Demonstrating with a somewhat smaller sparse matrix:
In order to calculate the cross product of the 340436 x 763717 sparse matrix with 35879680 elements, instead of storing the list of
data.table
objects in a list to pass torbindlist
, save the individual data.table objects for later processing using thefst
package. Instead of returning a singledata.table
, the following version ofbigcrossprod
returns a character vector of lengthnseg
containing .fst file paths. Again, demonstrating with the smaller matrix:I was able process a 340436 x 763717 sparse matrix with 35879680 non-zero elements in about 15 minutes with 16GB of RAM.
Explanation:
A walkthrough of the logic in
bigcrossprod
using the OP's 5 x 3 example matrix:And, finally, a parallel version of
bigcrossprod
(for Linux):