Find number of bigrams after filtered from stop words

822 views Asked by At

Case study Task 1

  • Import text corpus brown

  • Extract the list of words associated with text collections belonging to the news genre. Store the result in the variable news_words.

  • Convert each word of the list news_words into lower case, and store the result in lc_news_words.

  • Compute the length of each word present in the list lc_news_words, and store the result in the list len_news_words.

  • Compute bigrams of the list len_news_words. Store the result in the variable news_len_bigrams.

  • Compute the conditional frequency of news_len_bigrams, where condition and event refers to the length of the words. Store the result in cfd_news.

  • Determine the frequency of 6-letter words appearing next to a 4-letter word.

Task 2

  • Compute bigrams of the list lc_news_words, and store it in the variable lc_news_bigrams.

  • From lc_news_bigrams, filter bigrams where both words contain only alphabet characters. Store the result in lc_news_alpha_bigrams.

  • Extract the list of words associated with the corpus stopwords. Store the result in stop_words.

  • Convert each word of the list stop_words into lower case, and store the result in lc_stop_words.

  • Filter only the bigrams from lc_news_alpha_bigrams where the words are not part of lc_stop_words. Store the result in lc_news_alpha_nonstop_bigrams.

  • Print the total number of filtered bigrams.

Task 1 passed, but task 2 is getting failed please help me out where I am wrong!!!!

import nltk
from nltk.corpus import brown
from nltk.corpus import stopwords

news_words = brown.words(categories = 'news')
lc_news_words = [word.lower() for word in news_words]
len_news_words = [len(word) for word in lc_news_words]
news_len_bigrams = nltk.bigrams(len_news_words)
cfd_news = nltk.ConditionalFreqDist(news_len_bigrams )
print(cfd_news[4][6])

lc_news_bigrams = nltk.bigrams(lc_news_words)
lc_news_alpha_bigrams = [ (w1, w2) for w1, w2 in lc_news_bigrams if w1.isalpha() and w2.isalpha()]

stop_words = stopwords.words('english')
lc_stop_words = [word.lower() for word in stop_words]
lc_news_alpha_nonstop_bigrams = [(n1, n2) for n1, n2 in lc_news_alpha_bigrams if n1 not in lc_stop_words and n2 not in lc_stop_words]
print(len(lc_news_alpha_nonstop_bigrams))

Results

with english in code stop_words = stopwords.words('english')

1084

17704

with out english in code stop_words = stopwords.words()

1084

16876

1

There are 1 answers

0
Lohit On

stop_words = set(stopwords.words())

everything was good, just use the unique set from the list of stopwords. Also removing the 'english' parameter increasing the number of stop words and that is the actual set of stopwords to be considered.