I'm using Elasticsearch term suggester for spell correction. my index contains huge list of ads. Each ad has subject and body fields. I've found a problematic example for which the suggester is not suggesting correct suggestions.

I have lots of ads whose subject contains word "soffa" and also 5 ads whose subject contain word "sofa". Ideally, when I send "sofa" (wrong spelling) as text to suggester, it should return "soffa" (correct spelling) as suggestions (since soffa is correct spell and most of ads contains "soffa" and only few ads contains "sofa" (wrong spell)).

Here is my suggester query body :

{
  "suggest": {
    "text": "sofa",
    "subjectSuggester": {
      "term": {
        "field": "subject",
        "suggest_mode": "popular",
        "min_word_length": 1
      }
    }
  }
}

When I send above query, I get below response :

{
    "took": 6,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 0,
            "relation": "eq"
        },
        "max_score": null,
        "hits": []
    },
    "suggest": {
        "subjectSuggester": [
            {
                "text": "sof",
                "offset": 0,
                "length": 4,
                "options": [
                    {
                        "text": "soff",
                        "score": 0.6666666,
                        "freq": 298
                    },
                    {
                        "text": "sol",
                        "score": 0.6666666,
                        "freq": 101
                    },
                    {
                        "text": "saf",
                        "score": 0.6666666,
                        "freq": 6
                    }
                ]
            }
        ]
    }
}

As you see in above response, it returned "soff" but not "soffa" although I have lots of docs whose subject contains "soffa".

I even played with parameters like suggest_mode and string_distance but still no luck.

I also used phrase suggester instead of term suggester but still same. Here is my phrase suggester query :

{
    "suggest": {
        "text": "sofa",
        "subjectuggester": {
            "phrase": {
                "field": "subject",
                "size": 10,
                "gram_size": 3,
                "direct_generator": [
                    {
                        "field": "subject.trigram",
                        "suggest_mode": "always",
                        "min_word_length":1
                    }
                ]
            }
        }
    }
}

I somehow think it doesn't work when one character is missing instead of being misspelled. in the "soffa" example, one "f" is missing. while it works fine for misspells e.g it works fine for "vovlo". When I send "vovlo" it gives me "volvo".

Any help would be hugely appreciated.

3

There are 3 answers

1
rabbitbr On

Try changing the "string_distance".

{
  "suggest": {
    "text": "sof",
    "subjectSuggester": {
      "term": {
        "field": "title",
        "min_word_length":2,
        "string_distance":"ngram"
      }
    }
  }
}

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html#term-suggester

0
Code_Worm On

I've found the workaround myself. I added ngram filter and analyzer with max_shingle_size 3 which means trigram, then added a subfield with that analyzer (trigram) and performed suggester query on that field (instead of actual field) and it worked.

Here is the mapping changes :

{
    "settings": {
        "analysis": {
            "filter": {
                "shingle": {
                    "type": "shingle",
                    "min_shingle_size": 2,
                    "max_shingle_size": 3
                }
            },
            "analyzer": {
                "trigram": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "shingle"
                    ],
                    "char_filter": [
                        "diacritical_marks_filter"
                    ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "subject": {
                "type": "text",
                "fields": {
                    "trigram": {
                        "type": "text",
                        "analyzer": "trigram"
                    }
                }
            }
        }
    }
}

And here is my corrected query :

{
  "suggest": {
    "text": "sofa",
    "subjectSuggester": {
      "term": {
        "field": "subject.trigram",
        "suggest_mode": "popular",
        "min_word_length": 1,
        "string_distance": "ngram"
      }
    }
  }
}

Note that I'm performing suggester to subject.trigram instead of subject itself.

Here is the result :

{
    "suggest": {
        "subjectSuggester": [
            {
                "text": "sofa",
                "offset": 0,
                "length": 4,
                "options": [
                    {
                        "text": "soffa",
                        "score": 0.8,
                        "freq": 282
                    },
                    {
                        "text": "soffan",
                        "score": 0.6666666,
                        "freq": 5
                    },
                    {
                        "text": "som",
                        "score": 0.625,
                        "freq": 102
                    },
                    {
                        "text": "sol",
                        "score": 0.625,
                        "freq": 82
                    },
                    {
                        "text": "sony",
                        "score": 0.625,
                        "freq": 50
                    }
                ]
            }
        ]
    }
}

As you can see above soffa appears as first suggestion.

1
Talal Humaidi On

There is sth weird in your result for the term suggester for the word sofa, take a look at the text that is being corrected:

"suggest": {
    "subjectSuggester": [
        {
            "text": "sof",
            "offset": 0,
            "length": 4,
            "options": [
                {
                    "text": "soff",
                    "score": 0.6666666,
                    "freq": 298
                },
                {
                    "text": "sol",
                    "score": 0.6666666,
                    "freq": 101
                },
                {
                    "text": "saf",
                    "score": 0.6666666,
                    "freq": 6
                }
            ]
        }
    ]
}

As you can see it's sof and not sofa which means the correction is not for sofa but instead it's for sof, so I doubt that this issue is related to the analyzer you were using on this field, especially when looking at the results soff instead of soffa it's removing the last a