What indexer do I use to find the list in the collection that is most similar to my list?

118 views Asked by At

Lets say I have my list of ingredients: {'potato','rice','carrot','corn'}

and I want to return lists from a database that are most similar to mine:

{'beans','potato','oranges','lettuce'}, {'carrot','rice','corn','apple'} {'onion','garlic','radish','eggs'}

My query would return this first: {'carrot','rice','corn','apple'}

I've used Solr, and have looked at CloudSearch, ElasticSearch, Algolia, Searchify and Swiftype. These engines only seem to let me put in one query string and then filter by other facets.

In a real scenario my search list will be about 200 items long and will be matching against about a million lists in my database.

What technology should I use to accomplish what I want to do?

Should I look away from search indexers and more towards database-esque things like mongo, map reduce, hadoop... All I know are the names of other technologies and I just need someone to point me in the right direction on what technology path I should be exploring for this.

With so much data I can't really loop through it, I need to query everything at once.

1

There are 1 answers

5
BlueM On BEST ANSWER

I wonder what keeps you from trying it with Solr, as Solr provides much of what you need. You can declare the field as type="string" multiValued="true and save each list item as a value. Then, when querying, you specify each of the items in the list to look for as a search term for that field, and Solr will – by default – return the closest match. If you need exact control over what will be regarded as a match (e.g. at least 40% of the terms from the search list have to be in a matching list) you can use the mm EDisMax parameter, cf. Solr Wiki

Having said that, I must add that I’ve never searched for 200 query terms (do I unerstand correctly that the list whose contents should be searched will contain about 200 items?) and do not know how well that performs. But I guess that setting up a test core and filling it with random lists using a script should not take more than a few hours, so it should be possible to evaluate the performance of this approach without investing too much time.