I am trying to write a program that searches if a list of words are contained in a text file. I was thinking of using the intersection of two sets to accomplish this. I am wondering if there is any other efficient way of achieving this?
Using python for text analytics
3.4k views Asked by Ebelechukwu Nwafor At
2
There are 2 answers
0
On
Hashing can also be used for a quick lookup.
Read the file and parse the text.
Keep storing each unseen(new) word in a hashtable.
Finally, check each of your word in your lookup list if it is present in the hashtable
Dictionaries in Python are implemented using hash tables. So, it could be a good choice. This could be a starter code -
dictionary = {}
lookup_list = ["word1","word2","word3"]
file_data = []
with open("myfile.txt","r") as f:
file_data = f.read().split()
for word in file_data:
if word not in dictionary.keys():
dictionary[word] = 1
else:
dictionary[word] += 1
f.close()
result = [i for i in lookup_list if i in dictionary.keys()]
print result
Quick & Easy Method
textblob
is a library for text analysis.This part of the docs describes how to you obtain word and noun frequencies e.g.
Higher Performance, Slower Method
If you are looking for high performance and this is a big issue, perhaps try cleaning the file into a list of words with
regex
and then get frequencies by usingCollections
:Higher Performance Method for a Single Query
or for even higher performance for a single non-repeated query (if you are going to query lots of words, store as a
Counter
object):If you want even higher performance then use a high performance programming language!