A more efficient pythonic way to use if condition

168 views Asked by At

I have this piece of code which checks conditions:

def is_important(data_row):
    if data_row.get('important', None):
        return True
    add_data = get_additional_data(data_row)
    for word in NEGATIVE_WORDS:
        if word in add_data.lower():
            return False
    for word in POSITIVE_WORDS:
        if word in add_data.lower():
            return True
    return False

This is quite hard to follow (in my opinion), so I was wondering if anyone can suggest something more pythonic with shorter lines? Could I for example merge the two for loops? If I merge two for loops, would it consumes more time?

3

There are 3 answers

8
Ma0 On BEST ANSWER

That is more compact and short-circuits due to the anys much like your explicit loops did.

def is_important(data_row):
    if data_row.get('important', None):
        return True
    add_data = get_additional_data(data_row)
    if any(word in add_data.lower() for word in NEGATIVE_WORDS):  # negative check takes precedence.
        return False
    if any(word in add_data.lower() for word in POSITIVE_WORDS):
        return True
    return False

A couple of things on that:

  • why do you call .lower() when searching for NEGATIVE_WORDS and not for POSITIVE_WORDS?
  • if add_data contain both NEGATIVE_WORDS and POSITIVE_WORDS, the order of the last two ifs will affect the outcome. This is not good practice.
7
MSeifert On

This is quite hard to follow (in my opinion), so I was wondering if anyone can suggest something more pythonic with shorter lines?

Generally pythonic doesn't mean shorter lines. Pythonic code could should be easy to read and follow (at least with a bit of background). So if you find that hard to read you could factor it into a different function:

# I'm not sure if the function name is a good fit, it's just a suggestion.
def contains_at_least_one(data, words):  
    for word in words:
        if word in data:
            return True
    return False

def is_important(data_row):
    if data_row.get('important', None):
        return True
    add_data = get_additional_data(data_row).lower()
    if contains_at_least_one(add_data, NEGATIVE_WORDS):
        return False
    if contains_at_least_one(add_data, POSITIVE_WORDS):
        return True
    return False

Could I for example merge the two for loops?

Not really. Because the NEGATIVE_WORDS loop should take precedence (at least in your code) over the POSITIVE_WORDS loop. Except you meant factoring it in a function. Then see first.

If I merge two for loops, would it consumes more time?

I'm not sure what you meant by "merging" the loops but if you want it shorter you could use any in the approach above. It's equivalent to the for-loop and shorter - but, according to my and StefanPochmans benchmarks, slower:

def contains_at_least_one(data, words):
    return any(word in data for word in words)

def is_important(data_row):
    if data_row.get('important', None):
        return True
    add_data = get_additional_data(data_row).lower()
    if contains_at_least_one(add_data, NEGATIVE_WORDS):
        return False
    if contains_at_least_one(add_data, POSITIVE_WORDS):
        return True
    return False

You could even reduce the number of lines by using and for return. I would not recommend it because such constructs don't improve the readability, but that's your decision and it's one way to "shorten" code:

def is_important(data_row):
    if data_row.get('important', None):
        return True
    add_data = get_additional_data(data_row).lower()
    return (not contains_at_least_one(add_data, NEGATIVE_WORDS) and
            contains_at_least_one(add_data, POSITIVE_WORDS))

A bit far fetched but maybe you could even use sets to speed this up. This would require that you only look for whole word matches (not partial matches, not multi-word matches):

def contains_at_least_one(data, words):
    return data.intersection(words)

def is_important(data_row):
    if data_row.get('important', None):
        return True
    add_data = set(get_additional_data(data_row).lower().split())  # set and split!
    return not contains_at_least_one(add_data, NEGATIVE_WORDS) and contains_at_least_one(add_data, POSITIVE_WORDS)

See also the regex suggestions in tobias_k's answer if you don't want punctuation to break your matching. However the set approach is only meant as "small suggestion" - I doubt it could be applied in your case. But you need to judge that.

2
tobias_k On

Besides using any, you could also combine the different conditions into a single return statement, though whether that's clearer might be a matter of opinion.

def is_important(data_row):
    add_data = get_additional_data(data_row)
    return (data_row.get('important', None)
        or (not any(word in add_data.lower() for word in NEGATIVE_WORDS)
            and any(word in add_data         for word in POSITIVE_WORDS)))

Though if get_additional_data is expensive, you might keep that first if separate.

Also, you can probably speed up the check by converting add_data to a set of (lowercase) words first, but this changes the logic slightly, as this will e.g. not match word-fragments.

def is_important(data_row):
    add_data = set((word.lower() for word in get_additional_data(data_row).split()))
    return (data_row.get('important', None)
        or (not any(word in add_data for word in NEGATIVE_WORDS)
            and any(word in add_data for word in POSITIVE_WORDS)))

Or, instead of .split(), use e.g. re.findall(r"\w+") to find words without punctuation.

Depending on the size of the positive- and negative-lists is might also pay off to invert the check, e.g. any(word in POSITIVE_WORDS for word in add_data.split()), particularly if those are already set structures with fast lookup.