I have a list of say, 10,000 strings (A). I also have a vector of words (V).
What I want to do is to modify each string of A to keep only those words in the string which are present in V and remove others.
For example, let's say first element of A is "one two three check test"
. And V is vector ["one", "test", "nine"]
. So, the modified version of first element of A should look like "one test"
. The whole process needs to be repeated for every string of A. For each comparison, V will remain same.
I am doing something like following (this could have some bugs, but I just want to give an idea about how I am approaching the problem).
for i in range(len(A)):
a = []
text = nltk.word_tokenize(A[i])
for i in range(len(text)):
if text[i] in V:
a.append(text[i])
a = " ".join(a)
A['modified_string'][i] = a
Above way is very slow and inefficient. How can I achieve it in a fast and efficient manner?
Here is my attempt:
The complexity is pretty bad, but to improve it You would have to give us more details like how long is the V compared to elements of A, do You want the words to be in original order after the selection etc. It could be done faster using sets but the words wouldn't be in original order.