I have a list of 16,000 lists. Each sublist contains a tuple and a score, like this:
mylist = [
[('only','one','time'),10.54],
[('one','time'),3.21],
[('red','hot','chili','peppers'),0.223],
[('red','hot','chili'),1.98]
]
My goal is to iterate through combinations in mylist and remove any element when a superset or subset is detected. The element to be removed is based on the lowest score between the two. So in this example, I want to remove
[('one','time'),3.21],
[('red','hot','chili','peppers'),0.223]
because ('one','time') is a subset of ('only','one','time') and between the two, ('one','time') has the lowest score 10.54>3.21.
('red','hot','chili','peppers') is a superset of ('red','hot','chili') and between the two, 0.223<1.98
My initial solution was a brute force - get every possible combination from the list choose 2, then compare the tuples for subsets using the all() function, then drop the items with the min() score.
This performs poorly due the to number of combinations to search:
from itertools import combinations
removelist = []
for x,y in combinations(mylist,2):
if (all(word in x[0] for word in y[0]) or all(word in y[0] for word in x[0])):
smallest = min([x,y],key=itemgetter(1))
removelist.append(smallest)
removelist = set(removelist)
outlist = [x for x in mylist if x not in removelist]
return outlist
returns:
outlist = [
[('only','one','time'),10.54],
[('red','hot','chili'),1.98]
]
So for a list of ~16,000 sublists, this would be roughly:
combinations = n! / (r! * (n-r)!)
combinations = 16,000! / (2! * (15998)!)
combinations = 16,000 * 15999 / 2
combinations = 127,992,000
Is there a smarter way to do this, reducing the 127 million items I need to check?
This might be a thousand times faster than yours. First I convert the word tuples to sets for simpler and faster subset checks, like @Alexander. Then I sort by set size, so I don't have to check for superset. (Because if |A| ≤ |B|, then the only way A is a superset of B is if it is B, in which case it is also a subset of B).
And then comes my main trick. Let's say we have the word set {'red','hot','chili'}, and we want to find the word sets of which it is a subset. Do we need to check all other (larger or equal-size) sets? No. It suffices to check only those sets that contain the word 'red'. Or only those with 'hot'. Or only those with 'chili'. Let's take the rarest word, i.e., the one in the fewest sets (in this case I'd guess 'chili').
I decided to call your lists "songs", makes it nice to talk about them.
Update: Actually, instead of just using the song's rarest word and going through all its "supersets" (sets containing that word), consider all words in the current song and use the intersection of their "supersets". In my testing with made up data, it's even faster by about factor 1.6: