I have a body of text, 500 sentences. Sentences are clearly deliniated, lets assume by a period for simpleness sake. Each sentence has about 10-20 words.
I want to break it down into groups of words that statistically are used in the same sentence most often. Here's a simple example.
This is a sentence about pink killer cats chasing madonna.
Sometimes when whales fight bricklayers, everyone drinks champaigne.
You know Madonna has little cats on her slippers.
When whales drink whiskey, your golf game is over.
I do have a list of stopwords that get filtered out, in the case above I could imagine wanting to build these groups.
group 1: pink cats madonna
group 2: whales drink when
Or something like that. I realize this can be a quite complicated endeavor. I've been experimenting with TF IDF similarity, and haven't really gotten anywhere yet. I'm working in ruby, and would love to hear any thoughts/directions/suggestions people might have.
I liked the puzzle and here is my take on a possible solution*...
* although, I would recommend that next time you don't just throw your question without showing what you tried and where you're stuck ... Otherwise, it might seem that you're throwing your class homework at us...
Let's assume this is our text:
Seems to me that there are a number of stages to the tasks at hand...
Create a "words" catalog.
Count how many times each word appears in the text.
Remove any word that appears only once - it's irrelevant.
split the text into lowercase sentences.
Make a
sentences => relevant_words_used
Hash.Fill in the sentences Hash with the words used in each sentence. The following is a simplified way to do this (in an actual application you would need to separate words, to make sure 'cat' and 'caterpillar' don't overlap):
an example for the more complex version would be:
Your groups are in the
sentences.values
Array. Now it's time to find common groups and count how many times they repeat.Voila, these are the common groups:
The whole of the code might look like so:
EDIT
I corrected an issue with the code where the text wasn't persistent as lowercase. (
text.downcase!
vs.text.downcase
)EDIT2
I reviewed the issue of partial word issues (i.e.
cat
vs.caterpillar
ordog
vs.dogma
)