Collections.counter() is counting alphabets instead of words

991 views Asked by At

I have to count no. of most occured word from a dataframe in row df['messages']. It have many columns so I formatted and stored all rows as single string (words joint by space) in one variabel all_words. all_words have all words seperated by space. But when i tried to count most common word it shows me most used alphabet. My data is in form:

0    abc de fghi klm
1    qwe sd fd s dsdd sswd??
3    ded fsf sfsdc wfecew wcw.

Here is snippet of my code.

   from collections import Counter
    all_words = ' '
    for msg in df['messages'].values:
        words = str(msg).lower()
        all_words = all_words + str(words) + ' '
            
    count = Counter(all_words)
    count.most_common(3)

And here is its output:

[(' ', 5260), ('a', 2919), ('h', 1557)]

I also tried using df['messages'].value_counts(). But it returns most used rows(whole sentence) instead of words. Like:

asad adas asda     10
asaa as awe        3
wedxew dqwed       1

Please tell me where I am wrong or suggest any other method that might work.

2

There are 2 answers

1
Lucas On BEST ANSWER

Counter iterates over what you pass to it. If you pass it a string, it goes into iterating it has chars (and that's what it will count). If you pass it a list (where each list is a word), it will count by words.

from collections import Counter

text = "spam and more spam"

c = Counter()
c.update(text)  # text is a str, count chars
c
# Counter({'s': 2, 'p': 2, 'a': 3, 'm': 3, [...], 'e': 1})

c = Counter()
c.update(text.split())  # now is a list like: ['spam', 'and', 'more', 'spam']
c
# Counter({'spam': 2, 'and': 1, 'more': 1})

So, you should do something like that:

from collections import Counter

all_words = []
for msg in df['messages'].values:
    words = str(msg).lower() 
    all_words.append(words)

count = Counter(all_words)
count.most_common(3)

# the same, but with  generator comprehension
count = Counter(str(msg).lower() for msg in df['messages'].values)
0
XtianP On
from collections import Counter
all_words = []
for msg in df['messages'].values:
    words = str(msg).lower().strip().split(' ')
    all_words.extend(words)
            
count = Counter(all_words)
count.most_common(3)