MALLET Java API Importing Data

703 views Asked by At

I am trying to do Topic Modeling with the Java API. There is a handy example provided with the package. However, given the much larger size of my data, I think it would be impractical to import it all from one file.

I looked at the powerpoint presentation linked to in another MALLET question, and found something called a FileIterator that I believe I should be able to use in place of the CsvIterator used in their example Java code. However, I'm not sure if I'm using it correctly. I tried running my code with it, and it's stuck taking an impractically large amount of time on the line that's just creating the FileIterator. I haven't delved into the MALLET code yet to dissect this issue yet; I figured someone else might know more about it already. Can I just pass it a directory that contains several directories in which the documents themselves are stored?

And then there's also a chance I'm just giving it too much data at once.

So my overall question is really in 2 parts:

1) How large a scale can MALLET function at? I have ~500,000 6-line documents I'd like to give topics to. Is this feasible with MALLET in the first place?

2) If the answer to the above is that it is feasible, what is the best way to import this data with MALLET? If it's not feasible with MALLET, suggestions for what else I could use?


EDIT: I was indeed able to use the FileIterator, but its usage was not as I suspected. The simplest way to do what I was trying to do is to put all the individual files containing one instance in a single directory. I can then feed this directory to FileIterator, and it will work as CsvIterator does.

As for scalability, I was able to run about 10,000 of the short documents in a reasonable amount of time, but since LDA considers all the documents simultaneously, I don't think it will be feasible to do it for all the documents at once. However, the TopicInferencer class in MALLET will allow me to take as many documents as I can reasonably fit into the model and then infer topics on the rest of the documents. This was good enough for my needs.

2

There are 2 answers

0
London guy On BEST ANSWER

Did you reduce the corpus size and then run the topic modelling to see how soon it finishes the processing?

Also, here you may find some performance numbers related to Mallet Topic Modelling someone reported to measure it against their product.

http://www.slideshare.net/wadkarsameer/large-scale-topic-modeling

1
user1732578 On

I am Sameer Wadkar the author of the http://www.slideshare.net/wadkarsameer/large-scale-topic-modeling . I was able to scale it with my modified mallet version of LDA upto 2.5 million documents. I have a cleaner version of it here

https://github.com/sameeraxiomine/largelda

Send me an email if you want to use it. I was planning on having a list of user instructions at some point but I have not gotten around to it yet