I've been trying to find a suitable algorithms and data structure to perform a quick search of five million paragraphs based on the given keywords using python. I have looked the standard data structure and search algorithms, and related time complexities through a related post, including sets, lists, trees, and hash table etc.
the question listed as such: given data, the entry and associated identities, i.e.
ID:1, 1word1 1word2 1word3... 1wordj1
ID:2, 2word1 2word2 2word3... 2wordj2
ID:3, 3word1 3word2 3word3... 3wordj3
...
ID:N, Nword1 Nword2 Nword3... NwordjN
how to perform a search based on keywords in the syntax of
contain: keyword1 OR keyword2 Or ... Or keywordi
And
contain: not keywordi+1 and not keywordi+2 ... and not keywordj
(equivalently does not contain: keywordi+1 or ... keywordj)
The hash table based on a list of key words might be useful, i.e.
1(contain keyword1)0(does not contain keyword2).....1(contain keywordM)
or, equivalently, a function to compute
sum_{n=1}^M e^(2*pi*i*j/2^n)
, a binary tree algorithms in disguise, where j odd or even meant keywordn is contained at level n or not, and, by search for the range of the "hash angle" for the specific keywords level, the entries could be determined, with the arbitrary precession library.
Those algorithms have their own advantages and obvious defects. I even attempted a double binary trees, where both trees constructed as a list represents the set union or set intersection, and for the simple two keywords search it could be super fast, because it's just going to be two or three set operations, but for multiple keywords, the hand expansion of the logical operation became nasty. Those algorithms also required the preprocess of a fixed number of pre selected keywords, which is less flexible.
Is there a standard algorithms for the keywords search of large amount of entries?(five million) My aim is to reduce each of the multiple keywords search to the small finite multiple 5-10s range, so that the process could be extrapolated.