I have a list of strings that I want to remove items from. I have a list of keywords that I am searching for in these items. I cannot seem to get the output I am looking for. I am not sure if regular expressions are the right way to handle this.
I want the output to be ['/item/page/cat-dog', '/item/page/animal-planet']

valid = ['/item/page/cat-dog', '/item/page/animal-planet', '/item/page/variable']
keywords = ['cat','planet']


for item in valid: 
    #a = re.findall()
    #

3 Answers

0
Paindemie On

Python comes with the handy keywords in and not in to test if an object is or is not in a list.

for your problem, you can simply do :

new_list = []
for item in valid: 
    if os.path.basename(item) not in keywords:
        new_list.append(item)

os.path.basename gives the name of the files without the arborescence. new_list will then contain all the elements of valid in which the filenames were not in keyword.

0
Jose D. Rodriguez On

As far as I can understand, and based on @dan-d's comment what you need is

[s for s in valid if not any(q in s for q in keywords)]
0
snakecharmerb On

As suggested in the comments and other answers, the in operator may be used to check if a string is a substring of another string. For the example data in the question, using in is the simplest and fastest way to get the desired result.

If the requirement is to match '/item/page/cat-dog' but not '/item/page/catapult' - that is only match the word 'cat', not just the sequence c-a-t, then a regular expression may be used to do the matching.

The pattern to match a single word is '\bfoo\b' where '\b' marks a word boundary.

The alternation operator '|' is used to match one pattern or another, for example 'foo|bar' matches 'foo' or 'bar'.

Construct a pattern that matches the words in keywords; call re.escape on each keyword in case they contain characters that the regex engine might interpret as metacharacters.

>>> pattern = r'|'.join(r'\b{}\b'.format(re.escape(keyword)) for keyword in keywords)
>>> pattern
'\\bcat\\b|\\bplanet\\b'

Compile the pattern into a regular expression object.

>>> rx = re.compile(pattern)

Find the matches: using filter is elegant:

>>> matches = list(filter(rx.search, valid))
>>> matches
['/item/page/cat-dog', '/item/page/animal-planet']

But it's common to use a list comprehension:

>>> matches = [word for word in valid if rx.search(word)]
>>> matches
['/item/page/cat-dog', '/item/page/animal-planet']