my question is straightforward. I have a (binary) TDM and I want to reduce the number of rows to include only those rows that appear in at least two documents:
I thought that these two methods would produce the same result in a binary matrix:
> rowTotals = row_sums(tdm)
> dtm2 <- tdm[which(rowTotals > 2),]
> dtm2
<<TermDocumentMatrix (terms: 208361, documents: 763717)>>
Non-/sparse entries: 34812736/159094025101
Sparsity : 100%
Maximal term length: 154
Weighting : binary (bin)
> #alternative probably faster:
> atleast2 <- findFreqTerms(tdm, lowfreq = 2)
> dtm2 <- tdm[atleast2,]
> dtm2
<<TermDocumentMatrix (terms: 340436, documents: 763717)>>
Non-/sparse entries: 35076886/259961683726
Sparsity : 100%
Maximal term length: 308
Weighting : binary (bin)
yet it is not so.. Could you help figuring out why it isn't?
They produce the exact same result. You have a mistake in your second part. You are taking the frequency of 2 and more, while in the first part you are taking all the words with a frequency of 3 and more. If make sure both selection criteria are the same you will see that they will produce the same result. See code example below. Also speed comparison.
Are they the same?
Speed:
The selection via row_totals is slightly faster. But that is because
findFreqTerms
actually usesrow_sums
to get the info and has some extra lines of code to check if you pass it an document term matrix and if the frequencies you request are actual numbers.