keep words present in a given vector and remove others

147 views Asked by At

I have a list of say, 10,000 strings (A). I also have a vector of words (V).

What I want to do is to modify each string of A to keep only those words in the string which are present in V and remove others.

For example, let's say first element of A is "one two three check test". And V is vector ["one", "test", "nine"]. So, the modified version of first element of A should look like "one test". The whole process needs to be repeated for every string of A. For each comparison, V will remain same.

I am doing something like following (this could have some bugs, but I just want to give an idea about how I am approaching the problem).

for i in range(len(A)):

    a = []

    text = nltk.word_tokenize(A[i])

    for i in range(len(text)):
        if text[i] in V:
            a.append(text[i])

    a = " ".join(a)

    A['modified_string'][i] = a

Above way is very slow and inefficient. How can I achieve it in a fast and efficient manner?

4

There are 4 answers

0
Tony Babarino On

Here is my attempt:

>>> A = ["aba reer sdasd bab", "adb bab ergekj aba erger"]
>>> V = ["aba","bab"]
>>> map((lambda z: ' '.join(z)), map((lambda x: filter(lambda y: y in V, x.split())), A))
['aba bab', 'bab aba']

The complexity is pretty bad, but to improve it You would have to give us more details like how long is the V compared to elements of A, do You want the words to be in original order after the selection etc. It could be done faster using sets but the words wouldn't be in original order.

1
Hugues Fontenelle On

learn about

A = ["one two three check test", "one nine six seven", "one two six seven"]  
A_modified = list()  
V = ["one", "test", "nine"] 
V_set = set(V)  
for line in A:  
    text = set(line.split()) # or use NLTK, here I just wanted something that runs on all installs  
    A_modified.append(list(text.intersection(V_set))) 

Note that line = list(text.intersection(V_set)) will NOT work because of immutability

Edit:

Scope creep:-) Your original question wasn't specific enough, but if you want to keep the order as well as the non-unique elements, I'd do it with list comprehension:

for line in A:  
    A_modified += [[word for word in line.split() if word in V]]
0
poko On

for single A[0] item

' '.join(set(A[0].split(' ')).intersection(V))
0
boardrider On

Sets seem to be the appropriate data structures here:

A = ["aba reer sdasd bab", "adb bab ergekj aba erger", "aba", "bab" ]
V = ["aba","bab"]

vset = set(V)
for i in A:
    print tuple(set(i.split()).intersection(vset))