We have an OData-compliant API that delegates some of its full text search needs to an Elasticsearch cluster.
Since OData expressions can get quite complex, we decided to simply translate them into their equivalent Lucene query syntax and feed it into a query_string
query.
We do support some text-related OData filter expressions, such as:
startswith(field,'bla')
endswith(field,'bla')
substringof('bla',field)
name eq 'bla'
The fields we're matching against can be analyzed
, not_analyzed
or both (i.e. via a multi-field).
The searched text can be a single token (e.g. table
), only a part thereof (e.g. tab
), or several tokens (e.g. table 1.
, table 10
, etc).
The search must be case-insensitive.
Here are some examples of the behavior we need to support:
startswith(name,'table 1')
must match "Table 1", "table 100", "Table 1.5", "table 112 upper level"endswith(name,'table 1')
must match "Room 1, Table 1", "Subtable 1", "table 1", "Jeff table 1"substringof('table 1',name)
must match "Big Table 1 back", "table 1", "Table 1", "Small Table12"name eq 'table 1'
must match "Table 1", "TABLE 1", "table 1"
So basically, we take the user input (i.e. what is passed into the 2nd parameter of startswith
/endswith
, resp. the 1st parameter of substringof
, resp. the right-hand side value of the eq
) and try to match it exactly, whether the tokens fully match or only partially.
Right now, we're getting away with a clumsy solution highlighted below which works pretty well, but is far from being ideal.
In our query_string
, we match against a not_analyzed
field using the Regular Expression syntax. Since the field is not_analyzed
and the search must be case-insensitive, we do our own tokenizing while preparing the regular expression to feed into the query in order to come up with something like this, i.e. this is equivalent to the OData filter endswith(name,'table 8')
(=> match all documents whose name
ends with "table 8")
"query": {
"query_string": {
"query": "name.raw:/.*(T|t)(A|a)(B|b)(L|l)(E|e) 8/",
"lowercase_expanded_terms": false,
"analyze_wildcard": true
}
}
So, even though, this solution works pretty well and the performance is not too bad (which came out as a surprise), we'd like to do it differently and leverage the full power of analyzers in order to shift all this burden at indexing time instead of searching time. However, since reindexing all our data will take weeks, we'd like to first investigate if there's a good combination of token filters and analyzers that would help us achieve the same search requirements enumerated above.
My thinking is that the ideal solution would contain some wise mix of shingles (i.e. several tokens together) and edge-nGram (i.e. to match at the start or end of a token). What I'm not sure of, though, is whether it is possible to make them work together in order to match several tokens, where one of the tokens might not be fully input by the user). For instance, if the indexed name field is "Big Table 123", I need substringof('table 1',name)
to match it, so "table" is a fully matched token, while "1" is only a prefix of the next token.
Thanks in advance for sharing your braincells on this one.
UPDATE 1: after testing Andrei's solution
=> Exact match (eq
) and startswith
work perfectly.
A. endswith
glitches
Searching for substringof('table 112', name)
yields 107 docs. Searching for a more specific case such as endswith(name, 'table 112')
yields 1525 docs, while it should yield less docs (suffix matches should be a subset of substring matches). Checking in more depth I've found some mismatches, such as "Social Club, Table 12" (doesn't contain "112") or "Order 312" (contains neither "table" nor "112"). I guess it's because they end with "12" and that's a valid gram for the token "112", hence the match.
B. substringof
glitches
Searching for substringof('table',name)
matches "Party table", "Alex on big table" but doesn't match "Table 1", "table 112", etc. Searching for substringof('tabl',name)
doesn't match anything
UPDATE 2
It was sort of implied but I forgot to explicitely mention that the solution will have to work with the query_string
query, mainly due to the fact that the OData expressions (however complex they might be) will keep getting translated into their Lucene equivalent. I'm aware that we're trading off the power of the Elasticsearch Query DSL with the Lucene's query syntax, which is a bit less powerful and less expressive, but that's something that we can't really change. We're pretty d**n close, though!
UPDATE 3 (June 25th, 2019):
ES 7.2 introduced a new data type called search_as_you_type
that allows this kind of behavior natively. Read more at: https://www.elastic.co/guide/en/elasticsearch/reference/7.2/search-as-you-type.html
This is an interesting use case. Here's my take:
my_ngram_analyzer
is used to split every text into small pieces, how large the pieces are depends on your use case. I chose, for testing purposes, 25 chars.lowercase
is used since you said case-insensitive. Basically, this is the tokenizer used forsubstringof('table 1',name)
. The query is simple:my_edge_ngram_analyzer
is used to split the text starting from the beginning and this is specifically used for thestartswith(name,'table 1')
use case. Again, the query is simple:endswith(name,'table 1')
. For this I definedmy_reverse_edge_ngram_analyzer
which uses akeyword
tokenizer together withlowercase
and anedgeNGram
filter preceded and followed by areverse
filter. What this tokenizer basically does is to split the text in edgeNGrams but the edge is the end of the text, not the start (like with the regularedgeNGram
). The query:name eq 'table 1'
case, a simplekeyword
tokenizer together with alowercase
filter should do it The query:Regarding
query_string
, this changes the solution a bit, because I was counting onterm
to not analyze the input text and to match it exactly with one of the terms in the index.But this can be "simulated" with
query_string
if the appropriateanalyzer
is specified for it.The solution would be a set of queries like the following (always use that analyzer, changing only the field name):