LSA Similarity interface

689 views Asked by At

I am a PhD student in translation studies and I am currently working on my dissertation. I am using LSA Similarity interface as a method of analysis in my dissertation. My background is in linguistics and not computer science. I tried to find an easy LSA document categorisation tool but I could not find any. I tried to play with Gensim, I did not work. I think my problem is with linking my corpus (txt files) with the Gensim tool to do the analysis (I don't know how o do this step). I would greatly appreciate if anyone could help me with the analysis or direct me to any tool or easy tutorials to do it using Gensim.

I want to do the following: I want to apply document-doecument queries to retrieve the most relevant 5 documents from the corpus to the query document.

  1. I have 15 query document
  2. I have one corpus of (150 texts)The texts are short stories

I am desperate and I was hesitant to post this question here. I am sure that applying LSA in translation studies would add to the field and this makes me more persistent to find a way to do my analysis.

1

There are 1 answers

0
Gabriel On

The only really easy, user-friendly tool for LSA that is out there right now is http://lsa.colorado.edu/ . Unfortunately, it is a web-based tool only, and it does not allow you to train LSA on your own corpora. But depending on your needs, that may not matter.

If I'm understanding you correctly, you need document-document similarity scores between each of 15 query documents and each of 150 short stories (a total of 15*150=2250 similarity scores). If these query documents and short stories are in English, then you can use the version of LSA that is trained on the TASA corpus used in many studies of LSA as follows:

  • Go to http://lsa.colorado.edu/
  • Select One-To-Many Comparison
  • Copy-paste one of the short stories in the "Main text" box, and the 15 queries separated with a blank line in the "Texts to compare" box
  • Repeat for each of your short stories. A huge pain? Yes. But if you are desperate...

If you program a little bit in Python or R, other tools for LSA include http://clic.cimec.unitn.it/composes/toolkit/introduction.html and http://cran.r-project.org/web/packages/lsa/lsa.pdf , and would save you the manual labor of the above suggestion. Also, I know you already tried Gensim, but there is a nice tutorial for it at http://radimrehurek.com/gensim/tutorial.html that you might try following if you haven't already.