python pandas get ride of plural "s" in words to prepare for word count

1.6k views Asked by At

I have the following python pandas dataframe:

Question_ID | Customer_ID | Answer
    1           234         The team worked very hard ...
    2           234         All the teams have been working together ...

I am going to use my code to count words in the answer column. But beforehand, I want to take out the "s" from the word "teams", so that in the example above I count team: 2 instead of team:1 and teams:1.

How can I do this for all words?

3

There are 3 answers

0
DYZ On BEST ANSWER

You need to use a tokenizer (for breaking a sentence into words) and lemmmatizer (for standardizing word forms), both provided by the natural language toolkit nltk:

import nltk
wnl = nltk.WordNetLemmatizer()
[wnl.lemmatize(word) for word in nltk.wordpunct_tokenize(sentence)]
# ['All', 'the', 'team', 'have', 'been', 'working', 'together']
2
piRSquared On

use str.replace to remove the s from any 3 or more letter word that ends in 's'.

df.Answer.str.replace(r'(\w{2,})s\b', r'\1')

0                  The team worked very hard ...
1    All the team have been working together ...
Name: Answer, dtype: object

'{2,}' specifies 2 or more. That combined with the 's' ensures that you'll miss 'is'. You can set it to '{3,}' to ensure you skip 'its' as well.

0
Little Bobby Tables On

Try the NTLK toolkit. Specifically Stemming and Lemmatization. I have never personally used it but here you can try it out.

Here is an example of some tricky plurals,

its it's his quizzes fishes maths mathematics

becomes

it it ' s hi quizz fish math mathemat

You can see it deals with "his" (and "mathematics") poorly, but then again you could have lots of abbreviated "hellos". This is the nature of the beast.