I am trying to perform tfidf on a matrix. I would like to use gensim, but models.TfidfModel()
only works on a corpus and therefore returns a list of lists of varying lengths (I want a matrix).
The options are to somehow fill in the missing values of the list of lists, or just convert the corpus to a matrix
numpy_matrix = gensim.matutils.corpus2dense(corpus, num_terms=number_of_corpus_features)
Choosing the latter, I then try to convert this count matrix to a tf-idf weighted matrix:
def TFIDF(m):
#import numpy
WordsPerDoc = numpy.sum(m, axis=0)
DocsPerWord = numpy.sum(numpy.asarray(m > 0, 'i'), axis=1)
rows, cols = m.shape
for i in range(rows):
for j in range(cols):
amatrix[i,j] = (amatrix[i,j] / WordsPerDoc[j]) * log(float(cols) / DocsPerWord[i])
But, I get the error AttributeError: 'numpy.ndarray' object has no attribute 'A'
I copied the function above from another script. It was:
def TFIDF(self):
WordsPerDoc = sum(self.A, axis=0)
DocsPerWord = sum(asarray(self.A > 0, 'i'), axis=1)
rows, cols = self.A.shape
for i in range(rows):
for j in range(cols):
self.A[i,j] = (self.A[i,j] / WordsPerDoc[j]) * log(float(cols) / DocsPerWord[i])
Which I believe is where it's getting the A
from. However, I re-imported the function.
Why is this happening?
self.A
is either annp.matrix
orsparse
matrix. For bothA
means, return a copy that is anp.ndarray
. In other words, it converts the 2d matrix to a regular numpy array. Ifself
is already an array, it would produce your error.It looks like you have corrected that with your own version of
TFIDF
- except that uses 2 variables,m
andamatrix
instead ofself.A
.I think you need to look more at the error message and stack, to identify where that
.A
is. Also make sure you understand where the code expects a matrix, especially a sparse one. And whether your own code differs in that regard.I recall from other SO questions that one of the learning packages had switched to using sparse matrices, and that required adding
.todense()
to some of their code (which expected dense ones).