keep words present in a given vector and remove others

Question

keep words present in a given vector and remove others

127 views Asked by user3664020 At 17 February 2016 at 12:10

I have a list of say, 10,000 strings (A). I also have a vector of words (V).

What I want to do is to modify each string of A to keep only those words in the string which are present in V and remove others.

For example, let's say first element of A is "one two three check test". And V is vector ["one", "test", "nine"]. So, the modified version of first element of A should look like "one test". The whole process needs to be repeated for every string of A. For each comparison, V will remain same.

I am doing something like following (this could have some bugs, but I just want to give an idea about how I am approaching the problem).

for i in range(len(A)):

    a = []

    text = nltk.word_tokenize(A[i])

    for i in range(len(text)):
        if text[i] in V:
            a.append(text[i])

    a = " ".join(a)

    A['modified_string'][i] = a

Above way is very slow and inefficient. How can I achieve it in a fast and efficient manner?

Original Q&A

There are 4 answers

**Tony Babarino** · Answer 1 · 2016-02-17T12:26:52+00:00

Here is my attempt:

>>> A = ["aba reer sdasd bab", "adb bab ergekj aba erger"]
>>> V = ["aba","bab"]
>>> map((lambda z: ' '.join(z)), map((lambda x: filter(lambda y: y in V, x.split())), A))
['aba bab', 'bab aba']

The complexity is pretty bad, but to improve it You would have to give us more details like how long is the V compared to elements of A, do You want the words to be in original order after the selection etc. It could be done faster using sets but the words wouldn't be in original order.

**Hugues Fontenelle** · Answer 2 · 2016-02-17T12:33:30+00:00

learn about

for loops. Python is not C, you usually don't need the "i" variable (http://www.tutorialspoint.com/python/python_loop_control.htm)
sets. Useful for intersections (https://docs.python.org/2/library/sets.html)
the fact that you can't modify the list in place (immutable) therfore you need to initialize a new list, and append elements to it.

A = ["one two three check test", "one nine six seven", "one two six seven"]  
A_modified = list()  
V = ["one", "test", "nine"] 
V_set = set(V)  
for line in A:  
    text = set(line.split()) # or use NLTK, here I just wanted something that runs on all installs  
    A_modified.append(list(text.intersection(V_set)))

Note that line = list(text.intersection(V_set)) will NOT work because of immutability

Edit:

Scope creep:-) Your original question wasn't specific enough, but if you want to keep the order as well as the non-unique elements, I'd do it with list comprehension:

for line in A:  
    A_modified += [[word for word in line.split() if word in V]]

**poko** · Answer 3 · 2016-02-17T12:17:27+00:00

poko On 17 February 2016 at 12:17

for single A[0] item

' '.join(set(A[0].split(' ')).intersection(V))

**boardrider** · Answer 4 · 2016-02-18T17:08:38+00:00

boardrider On 18 February 2016 at 17:08

Sets seem to be the appropriate data structures here:

A = ["aba reer sdasd bab", "adb bab ergekj aba erger", "aba", "bab" ]
V = ["aba","bab"]

vset = set(V)
for i in A:
    print tuple(set(i.split()).intersection(vset))

TechQA.

keep words present in a given vector and remove others

There are 4 answers

Related Questions in PYTHON

Related Questions in FOR-LOOP

Related Questions in CORPUS

Related Questions in TERM-DOCUMENT-MATRIX

Popular Questions

Popular Tags

Trending Questions