Custom analyzer, use case : zip-code [ElasticSearch]

983 views Asked by At

Let be a set index/type named customers/customer. Each document of this set has a zip-code as property. Basically, a zip-code can be like:

  • String-String (ex : 8907-1009)
  • String String (ex : 211-20)
  • String (ex : 30200)

I'd like to set my index analyzer to get as many documents as possible that could match. Currently, I work like that :

PUT /customers/
{
"mappings":{
    "customer":{
        "properties":{
             "zip-code": {
                  "type":"string"
                  "index":"not_analyzed"
              }
              some string properties ...
         }
     }
 }

When I search a document I'm using that request :

GET /customers/customer/_search
{
  "query":{
    "prefix":{
      "zip-code":"211-20"
     }
   }
}

That works if you want to search rigourously. But for instance if the zip-code is "200 30", then searching with "200-30" will not give any results. I'd like to give orders to my index analyser in order to don't have this problem. Can someone help me ? Thanks.

P.S. If you want more information, please let me know ;)

1

There are 1 answers

3
xeraa On BEST ANSWER

As soon as you want to find variations you don't want to use not_analyzed.

Let's try this with a different mapping:

PUT zip
{
  "settings": {
    "number_of_shards": 1, 
    "analysis": {
      "analyzer": {
        "zip_code": {
          "tokenizer": "standard",
          "filter": [ ]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "zip": {
          "type": "text",
          "analyzer": "zip_code"
        }
      }
    }
  }
}

We're using the standard tokenizer; strings will be broken up at whitespaces and punctuation marks (including dashes) into tokens. You can see the actual tokens if you run the following query:

POST zip/_analyze
{
  "analyzer": "zip_code",
  "text": ["8907-1009", "211-20", "30200"]
}

Add your examples:

POST zip/_doc
{
  "zip": "8907-1009"
}
POST zip/_doc
{
  "zip": "211-20"
}
POST zip/_doc
{
  "zip": "30200"
}

Now the query seems to work fine:

GET zip/_search
{
  "query": {
    "match": {
      "zip": "211-20"
    }
  }
}

This will also work if you just search for "211". However, this might be too lenient, since it will also find "20", "20-211", "211-10",...

What you probably want is a phrase search where all the tokens in your query need to be in the field and also in the right order:

GET zip/_search
{
  "query": {
    "match_phrase": {
      "zip": "211"
    }
  }
}

Addition:

If the ZIP codes have a hierarchical meaning (if you have "211-20" you want this to be found when searching for "211", but not when searching for "20"), you can use the path_hierarchy tokenizer.

So changing the mapping to this:

PUT zip
{
  "settings": {
    "number_of_shards": 1, 
    "analysis": {
      "analyzer": {
        "zip_code": {
          "tokenizer": "zip_tokenizer",
          "filter": [ ]
        }
      },
      "tokenizer": {
        "zip_tokenizer": {
          "type": "path_hierarchy",
          "delimiter": "-"
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "zip": {
          "type": "text",
          "analyzer": "zip_code"
        }
      }
    }
  }
}

Using the same 3 documents from above you can use the match query now:

GET zip/_search
{
  "query": {
    "match": {
      "zip": "1009"
    }
  }
}

"1009" won't find anything, but "8907" or "8907-1009" will.

If you want to also find "1009", but with a lower score, you'll have to analyze the zip code with both variations I have shown (combine the 2 versions of the mapping):

PUT zip
{
  "settings": {
    "number_of_shards": 1, 
    "analysis": {
      "analyzer": {
        "zip_hierarchical": {
          "tokenizer": "zip_tokenizer",
          "filter": [ ]
        },
          "zip_standard": {
          "tokenizer": "standard",
          "filter": [ ]
        }
      },
      "tokenizer": {
        "zip_tokenizer": {
          "type": "path_hierarchy",
          "delimiter": "-"
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "zip": {
          "type": "text",
          "analyzer": "zip_standard",
          "fields": {
            "hierarchical": {
              "type": "text",
              "analyzer": "zip_hierarchical"
            }
          }
        }
      }
    }
  }
}

Add a document with the inverse order to properly test it:

POST zip/_doc
{
  "zip": "1009-111"
}

Then search both fields, but boost the one with the hierarchical tokenizer by 3:

GET zip/_search
{
  "query": {
    "multi_match" : {
      "query" : "1009",
      "fields" : [ "zip", "zip.hierarchical^3" ] 
    }
  }
}

Then you can see that "1009-111" has a much higher score than "8907-1009".