How to optimize comparing combinations in a python list

1.2k views Asked by At

I have a list of 16,000 lists. Each sublist contains a tuple and a score, like this:

mylist = [
[('only','one','time'),10.54],
[('one','time'),3.21],
[('red','hot','chili','peppers'),0.223],
[('red','hot','chili'),1.98]
]  

My goal is to iterate through combinations in mylist and remove any element when a superset or subset is detected. The element to be removed is based on the lowest score between the two. So in this example, I want to remove

[('one','time'),3.21],
[('red','hot','chili','peppers'),0.223]

because ('one','time') is a subset of ('only','one','time') and between the two, ('one','time') has the lowest score 10.54>3.21.
('red','hot','chili','peppers') is a superset of ('red','hot','chili') and between the two, 0.223<1.98

My initial solution was a brute force - get every possible combination from the list choose 2, then compare the tuples for subsets using the all() function, then drop the items with the min() score.
This performs poorly due the to number of combinations to search:

    from itertools import combinations
    removelist = []
    for x,y in combinations(mylist,2):
        if (all(word in x[0] for word in y[0]) or all(word in y[0] for word in x[0])):
            smallest = min([x,y],key=itemgetter(1))
            removelist.append(smallest)
    removelist = set(removelist)
    outlist = [x for x in mylist if x not in removelist]
    return outlist  

returns:

outlist = [
[('only','one','time'),10.54],
[('red','hot','chili'),1.98]
]

So for a list of ~16,000 sublists, this would be roughly:

combinations = n! / (r! * (n-r)!)  
combinations = 16,000! / (2! * (15998)!)  
combinations = 16,000 * 15999 / 2
combinations = 127,992,000  

Is there a smarter way to do this, reducing the 127 million items I need to check?

2

There are 2 answers

5
Stefan Pochmann On BEST ANSWER

This might be a thousand times faster than yours. First I convert the word tuples to sets for simpler and faster subset checks, like @Alexander. Then I sort by set size, so I don't have to check for superset. (Because if |A| ≤ |B|, then the only way A is a superset of B is if it is B, in which case it is also a subset of B).

And then comes my main trick. Let's say we have the word set {'red','hot','chili'}, and we want to find the word sets of which it is a subset. Do we need to check all other (larger or equal-size) sets? No. It suffices to check only those sets that contain the word 'red'. Or only those with 'hot'. Or only those with 'chili'. Let's take the rarest word, i.e., the one in the fewest sets (in this case I'd guess 'chili').

I decided to call your lists "songs", makes it nice to talk about them.

from collections import defaultdict

def process_list_stefan(mylist):

    # Change to sets, attach the index, and sort by number of words (many to few)
    songs = [(i, set(words), score) for i, (words, score) in enumerate(mylist)]
    songs.sort(key=lambda song: -len(song[1]))

    # Check songs against others, identify the ones to remove
    remove = set()
    songs_with_word = defaultdict(list)
    for song in songs:
        i, words1, score1 = song
        # Pick the song's rarest word
        word = min(words1, key=lambda word: len(songs_with_word[word]))
        # Go through songs containing that word
        for j, words2, score2 in songs_with_word[word]:
            if words1 <= words2:
                # Lower score loses. In case of tie, lower index loses.
                remove.add(min((score1, i), (score2, j))[1])
        # Make this song available as superset candidate
        for word in words1:
            songs_with_word[word].append(song)

    # Apply the removals
    return [song for i, song in enumerate(mylist) if i not in remove]

Update: Actually, instead of just using the song's rarest word and going through all its "supersets" (sets containing that word), consider all words in the current song and use the intersection of their "supersets". In my testing with made up data, it's even faster by about factor 1.6:

from collections import defaultdict

def process_list_stefan(mylist):

    # Change to sets, attach the index, and sort by number of words (many to few)
    songs = [(i, set(words), score) for i, (words, score) in enumerate(mylist)]
    songs.sort(key=lambda song: -len(song[1]))

    # Helper: Intersection of sets
    def intersect(sets):
        s = next(sets).copy()
        for t in sets:
            s &= t
        return s

    # Check songs against others, identify the ones to remove
    remove = set()
    songs_with_word = defaultdict(set)
    for song in songs:
        i, words1, score1 = song
        for j in intersect(songs_with_word[word] for word in words1):
            # Lower score loses. In case of tie, lower index loses.
            remove.add(min((score1, i), (mylist[j][1], j))[1])
        # Make this song available as superset candidate
        for word in words1:
            songs_with_word[word].add(i)

    # Apply the removals
    return [song for i, song in enumerate(mylist) if i not in remove]
5
Alexander On

First, create a new list that retains the original scores but converts the tuples of words into sets for faster comparisons and set membership testing.

Enumerate through each set of words and scores in this new list and compare against all remaining sets and scores. Using sets, we can detect subsets via s1.issubset(s2) and supersets via s1.issuperset(s2).

Once the subset/superset has been detected, we compare scores. If the current record has a higher score, we mark the other for removal and then continue comparing against the remaining records. Otherwise, we add the current index location to a set of indices to be subsequently removed and continue any remaining comparisons against this record.

Once we have processed all of the records, we use a conditional list comprehension to create a new list of all records to keep.

In terms of subset comparisons, worst case time complexity is O(n^2) / 2, which is still O(n^2). Of course, each subset comparison has its own time complexity based on the number of unique words in each sublist. This solution thus makes the same number of comparisons as the OP's for x,y in combinations(mylist,2) method, but the subset/superset comparisons are done using sets rather than lists. As a result, this method should still be significantly faster.

def process_list(my_list):
    # Convert tuples to sets.
    my_sets = [(set(tuples), score) for tuples, score in my_list]

    idx_to_remove = set()
    for i, (subset1, score1) in enumerate(my_sets):
        for j, (subset2, score2) in enumerate(my_sets[(i + 1):], start=i + 1):
            if subset1.issubset(subset2) | subset1.issuperset(subset2):  
                # Subset/Superset detected.
                idx_to_remove.add(i if score1 < score2 else j)

    # Remove filtered items from list and return filtered list.
    return [tup for n, tup in enumerate(my_list) if n not in idx_to_remove]

# TEST CASES

# Case 1.
mylist = [
    [('only','one','time'), 10.54],
    [('one','time'), 3.21],
    [('red','hot','chili','peppers'), 0.223],
    [('red','hot','chili'), 1.98],
] 
>>> process_list(mylist)
[[('only', 'one', 'time'), 10.54], [('red', 'hot', 'chili'), 1.98]]

# Case 2.
# ('a', 'b', 'd') is superset of ('a', 'b') and has a lower score, so remove former.
# ('a', 'b') is a subset of ('a', 'b', 'c') and has a lower score, so remove former.
mylist = [[('a', 'b', 'c'), 3], [('a', 'b', 'd'), 1], [('a', 'b'), 2]]
>>> process_list(mylist)
[[('a', 'b', 'c'), 3]]

# Case 3.  Same items as Case 2, but different order.  Same logic as Case 2.
mylist = [[('a', 'b'), 2], [('a', 'b', 'c'), 3], [('a', 'b', 'd'), 1]]
>>> process_list(mylist)
[[('a', 'b', 'c'), 3]]

# Case 4.
# ('a', 'b', 'c') is a superset of ('a', 'b') and has a lower score, so remove former.
# ('d','c') is a subset of ('d','c','w') and has a lower score, so remove former.
mylist = [[('a', 'b'), 2], [('a', 'b', 'c'), 1], [('d','c','w'), 4], [('d','c'), 2]]
>>> process_list(mylist)
[[('a', 'b'), 2], [('d', 'c', 'w'), 4]]