Elasticsearch - Default token in Analyzer if emitted tokens are empty

18 views Asked by At

The requirement is - when firing a Search query on Elasticsearch index, we have one big Elastic Query DSL query built to run a a search on the index.

For one of the property field, we have applied a custom analyzer.

One of the Filter used in the analyzer is a synonym translator, which produces one of the 4 expected tokens based on input strings/tokens.

IF nothing matches, is there a way to test if emitted tokens are null - only then set "typeA" as default output token ?

If the input text had no matching synonym to map, then I get only empty token array. But I need to place this in a must[] block (i.e. this CustomTypeAnalyzer is applied on a property in a query which is used in a must query block.

So it cannot be empty, I need a default token to be emited by the Analyzer, if my choice of default token is 'typeA', then I need my analyzer to emit the default token only if the CustomTranlator did not yeild any tokens.

How can I do it ?

I tried conditional filter, but that too seems to operate only on input tokens only.

I basically need to be able to test if emitted tokens are null.

I cannot run multiple queries (analyzer api first then search api etc is not possible/feasible), my query builder will run and fire one search query on the index.

So I want my Custom Analyzer itself to be conditionally able to emit a default token if the emitted tokens were to be null.

Regards, Sumanth

My custom analyzer is like this

"MyCustomTypeAnalyzer" : { "filter" : [ "EngStopWordsRemover", "lowercase", "EngStemmer", "CustomTranslator", "SpaceTokenRemover", "unique" ], "tokenizer" : "standard" }

The custom translator is something like this

"CustomTranslator" : { "type" : "synonym", "synonyms" : [ "volunt,voluntari,chip,regist => typeA", "program,event,particip,partak,regist => typeB", "contribut,donat,donor,give,provid,sponsor,fund,raise,money,support => typeC", "enlist,associat,enrol,join,member,membership,regist => typeD", ] }

  1. For an input string like "I would like to spend some time teaching kids" - then I get empty tokens.

  2. For an input string "I would like to join a teaching activity" emits the token "typeD"

I want to be able to test the emitted tokens size of the analyzer and emit a default token typeA conditionally, but this must be built into the analyzer itself

1

There are 1 answers

2
G0l0s On

your question isn't clear. I hope my solution is fitted for you

Mapping with your synonym filter and my conditional pattern filter to replace unsynonymed tokens with the default token 'typeA'

PUT /token_replace
{
    "settings": {
        "analysis": {
            "analyzer": {
                "replacing_token_analyzer": {
                    "tokenizer": "standard",
                    "filter": [
                        "typead_synonym_filter",
                        "conditional_replacer_filter"
                    ]
                }
            },
            "filter": {
                "typead_synonym_filter": {
                    "type": "synonym",
                    "synonyms": [
                        "volunt,voluntari,chip,regist => typeA",
                        "program,event,particip,partak,regist => typeB",
                        "contribut,donat,donor,give,provid,sponsor,fund,raise,money,support => typeC",
                        "enlist,associat,enrol,join,member,membership,regist => typeD"
                    ]
                },
                "conditional_replacer_filter": {
                    "type": "pattern_replace",
                    "pattern": "(.++)(?<!\\btype[A-D])",
                    "replacement": "typeA"
                }
            }
        }
    }
}

Synonym pair for typeA is meaningless. All unreplaced tokens are replaced with 'typeA'

Your first text

GET /token_replace/_analyze?filter_path=tokens.token
{
    "analyzer": "replacing_token_analyzer",
    "text": "I would like to spend some time teaching kids"
}

Response

{
    "tokens" : [
        {
            "token" : "typeA"
        },
        {
            "token" : "typeA"
        },
        {
            "token" : "typeA"
        },
        {
            "token" : "typeA"
        },
        {
            "token" : "typeA"
        },
        {
            "token" : "typeA"
        },
        {
            "token" : "typeA"
        },
        {
            "token" : "typeA"
        },
        {
            "token" : "typeA"
        }
    ]
}

Your second text

GET /token_replace/_analyze?filter_path=tokens.token
{
    "analyzer": "replacing_token_analyzer",
    "text": "I would like to join a teaching activity"
}

Response

{
    "tokens" : [
        {
            "token" : "typeA"
        },
        {
            "token" : "typeA"
        },
        {
            "token" : "typeA"
        },
        {
            "token" : "typeA"
        },
        {
            "token" : "typeD"
        },
        {
            "token" : "typeA"
        },
        {
            "token" : "typeA"
        },
        {
            "token" : "typeA"
        }
    ]
}