How to keep punctuation in elasticsearch's thai tokenizer

23 views Asked by Linux_cat At 02 February 2024 at 16:33

I'm working with Elasticsearch 7.17.1 to analyze Thai text. My goal is to tokenize Thai text while also retaining punctuation as separate tokens. However, I've encountered a challenge: the default behavior of most Elasticsearch analyzers, including the Thai tokenizer, is to discard punctuation, and I haven't found a way to configure them to do otherwise.

I attempted to create a custom analyzer in hopes of achieving this, but so far, I've had no success. Below is my latest attempt:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "thai_with_punctuation": {
          "tokenizer": "thai",
          "filter": ["punctuation_filter"]
        }
      },
      "filter": {
        "punctuation_filter": {
          "type": "pattern_capture",
          "preserve_original": true,
          "patterns": [
            "([\\p{Punct}])"
          ]
        }
      }
    }
  }
}

When analyzing text with the custom analyzer:

POST /my_thai_index/_analyze
{
  "analyzer": "thai_with_punctuation",
  "text": "(เปิด) ไม่ เป็น??? ??lol."
}

The response ignores punctuation:

{
    "tokens": [
        {
            "token": "เปิด",
            "start_offset": 1,
            "end_offset": 5,
            "type": "word",
            "position": 0
        },
        {
            "token": "ไม่",
            "start_offset": 7,
            "end_offset": 10,
            "type": "word",
            "position": 1
        },
        {
            "token": "เป็น",
            "start_offset": 11,
            "end_offset": 15,
            "type": "word",
            "position": 2
        },
        {
            "token": "lol",
            "start_offset": 21,
            "end_offset": 24,
            "type": "word",
            "position": 3
        }
    ]
}

Retaining punctuation is crucial for my application because it's a requirement for another part of the system which adjusts the behaviour based on the punctuation information.

Is there a workaround or a different approach to achieve this without creating a custom Elasticsearch plugin?

Original Q&A

TechQA.

How to keep punctuation in elasticsearch's thai tokenizer

There are 0 answers

Related Questions in ELASTICSEARCH

Related Questions in THAI

Popular Questions

Trending Questions