Using python for text analytics

3.4k views Asked by At

I am trying to write a program that searches if a list of words are contained in a text file. I was thinking of using the intersection of two sets to accomplish this. I am wondering if there is any other efficient way of achieving this?

2

There are 2 answers

0
Alexander McFarlane On

Quick & Easy Method

textblob is a library for text analysis.

This part of the docs describes how to you obtain word and noun frequencies e.g.

from textblob import TextBlob

>>> monty = TextBlob("We are no longer the Knights who say Ni. "
...                     "We are now the Knights who say Ekki ekki ekki PTANG.")
>>> monty.words.count('ekki', case_sensitive=False)
3

Higher Performance, Slower Method

If you are looking for high performance and this is a big issue, perhaps try cleaning the file into a list of words with regex and then get frequencies by using Collections:

from collections import Counter
words = ['b','b','the','the','the','c']

print Counter(words)
Counter({'the': 3, 'b': 2, 'c': 1})

Higher Performance Method for a Single Query

or for even higher performance for a single non-repeated query (if you are going to query lots of words, store as a Counter object):

words.count('the')
3

If you want even higher performance then use a high performance programming language!

0
Utsav T On

Hashing can also be used for a quick lookup.

  1. Read the file and parse the text.

  2. Keep storing each unseen(new) word in a hashtable.

  3. Finally, check each of your word in your lookup list if it is present in the hashtable

Dictionaries in Python are implemented using hash tables. So, it could be a good choice. This could be a starter code -

dictionary  = {}
lookup_list = ["word1","word2","word3"]
file_data = []

with open("myfile.txt","r") as f:
    file_data = f.read().split()

for word in file_data:
    if word not in dictionary.keys():
        dictionary[word] = 1
    else:
        dictionary[word] += 1

f.close()

result = [i for i in lookup_list if i in dictionary.keys()]

print result