How can I group words based on how often they are used in the same sentence?

131 views Asked by At

I have a body of text, 500 sentences. Sentences are clearly deliniated, lets assume by a period for simpleness sake. Each sentence has about 10-20 words.

I want to break it down into groups of words that statistically are used in the same sentence most often. Here's a simple example.

This is a sentence about pink killer cats chasing madonna.
Sometimes when whales fight bricklayers, everyone drinks champaigne.
You know Madonna has little cats on her slippers.
When whales drink whiskey, your golf game is over.

I do have a list of stopwords that get filtered out, in the case above I could imagine wanting to build these groups.

group 1: pink cats madonna
group 2: whales drink when

Or something like that. I realize this can be a quite complicated endeavor. I've been experimenting with TF IDF similarity, and haven't really gotten anywhere yet. I'm working in ruby, and would love to hear any thoughts/directions/suggestions people might have.

1

There are 1 answers

0
Myst On

I liked the puzzle and here is my take on a possible solution*...

* although, I would recommend that next time you don't just throw your question without showing what you tried and where you're stuck ... Otherwise, it might seem that you're throwing your class homework at us...

Let's assume this is our text:

text = 'This is a sentence about pink killer cats chasing madonna.
        Sometimes when whales fight bricklayers, everyone drinks champaigne.
        You know Madonna has little cats on her slippers.
        When whales drink whiskey, your golf game is over.'

Seems to me that there are a number of stages to the tasks at hand...

  1. Create a "words" catalog.

  2. Count how many times each word appears in the text.

    require 'strscan'
    words = {}
    scn = StringScanner.new(text.downcase)
    ( words[scn.matched] =  words[scn.matched].to_i + 1 if scn.scan(/[\w]*/) ) while (scn.skip(/[^\w]*/) > 0) || !scn.eos?
    
  3. Remove any word that appears only once - it's irrelevant.

    words.delete_if {|w, v| v <= 1}
    
  4. split the text into lowercase sentences.

  5. Make a sentences => relevant_words_used Hash.

    sentences = {}
    text.downcase.split(/\.[\s]*/).each {|s| sentences[s] = []}
    
  6. Fill in the sentences Hash with the words used in each sentence. The following is a simplified way to do this (in an actual application you would need to separate words, to make sure 'cat' and 'caterpillar' don't overlap):

    words.each {|w, c| sentences.each {|s, v| v << w if s.include? w} }
    

    an example for the more complex version would be:

    sentences.each {|s, v| tmp = s.split(/[^\w]+/); words.each {|w, c| v << w if tmp.include? w} }
    
  7. Your groups are in the sentences.values Array. Now it's time to find common groups and count how many times they repeat.

    common_groups = {}
    tmp_groups = sentences.values
    until tmp_groups.empty?
       active_group = tmp_groups.pop
       tmp_groups.each do |g|
            common = active_group & g
            next if common.empty?
            common_groups[common] = [2,(common_groups[common].to_i + 1)].max
       end
    end
    

Voila, these are the common groups:

common_groups.each {|g, c| puts "the word(s) #{g} were common to #{c} sentences."}

# => the word(s) ["is"] were common to 2 sentences.
# => the word(s) ["when", "whales"] were common to 2 sentences.
# => the word(s) ["cats", "madonna"] were common to 2 sentences.

The whole of the code might look like so:

text = 'This is a sentence about pink killer cats chasing madonna.
        Sometimes when whales fight bricklayers, everyone drinks champaigne.
        You know Madonna has little cats on her slippers.
        When whales drink whiskey, your golf game is over.'

require 'strscan'
text.downcase!
words = {}
scn = StringScanner.new(text)

( words[scn.matched] =  words[scn.matched].to_i + 1 if scn.scan(/[\w]*/) ) while (scn.skip(/[^\w]*/) > 0) || !scn.eos?

words.delete_if {|w, v| v <= 1}

sentences = {}
text.split(/\.[\s]*/).each {|s| sentences[s] = []}

# # A better code will split the sentences into words to
# # avoid partial recognition (cat vs. caterpillar).
# # for example:
sentences.each {|s, v| tmp = s.split(/[^\w]+/); words.each {|w, c| v << w if tmp.include? w} }
# # The following is the simplified version above:
# words.each {|w, c| sentences.each {|s, v| v << w if s.include? w} }

common_groups = {}
tmp_groups = sentences.values
until tmp_groups.empty?
   active_group = tmp_groups.pop
   tmp_groups.each do |g|
        common = active_group & g
        next if common.empty?
        common_groups[common] = [2,(common_groups[common].to_i + 1)].max
   end
end

common_groups.each {|g, c| puts "the word(s) #{g} were common to #{c} sentences."}

# => the word(s) ["is"] were common to 2 sentences.
# => the word(s) ["when", "whales"] were common to 2 sentences.
# => the word(s) ["cats", "madonna"] were common to 2 sentences.

EDIT

I corrected an issue with the code where the text wasn't persistent as lowercase. (text.downcase! vs. text.downcase)

EDIT2

I reviewed the issue of partial word issues (i.e. cat vs. caterpillar or dog vs. dogma)