I have a list of tuples, about 100k entries. Each tuple consists of an id and a string, my goal is to list the ids of the tuples, whose strings contain a substring from a given list of substrings. My current solution is through set comprehension, ids can repeat.
tuples = [(id1, 'cheese trees'), (id2, 'freezy breeze'),...]
vals = ['cheese', 'flees']
ids = {i[0] for i in tuples if any(val in i[1] for val in vals)}
output: {id1}
Is there an algorithm that would allow doing that quicker? I'm interested in both exact substring matches and also possibly in the approximate ones. The main thing I'm after here is an algorithm that would offer speed advantage over the comprehension.
DISCLAIMER I'm the author of trrex
For the case of the exact matching, one approach for solving this, is to use a Trie, as mentioned in the comments. trrex is a library that makes a Trie-Regex (a Trie in regex format) that can be used in conjunction with the regular expression engine of Python:
Output
The timings for the above functions are:
The data is a list of around 2K questions from jeopardy, and 500 randomly chosen words. You can find here the resources for reproducing the experiments.
UPDATE
If you add the grouping strategy mentioned in the comments the time improvements increases, below is the function:
and the timings:
The approach of grouping + trrex gives you an approximated 10 times improvement on performance. But take this last result with a grain of salt because it's very dependent on the dataset.