I'm working with Elasticsearch 7.17.1 to analyze Thai text. My goal is to tokenize Thai text while also retaining punctuation as separate tokens. However, I've encountered a challenge: the default behavior of most Elasticsearch analyzers, including the Thai tokenizer, is to discard punctuation, and I haven't found a way to configure them to do otherwise.
I attempted to create a custom analyzer in hopes of achieving this, but so far, I've had no success. Below is my latest attempt:
{
"settings": {
"analysis": {
"analyzer": {
"thai_with_punctuation": {
"tokenizer": "thai",
"filter": ["punctuation_filter"]
}
},
"filter": {
"punctuation_filter": {
"type": "pattern_capture",
"preserve_original": true,
"patterns": [
"([\\p{Punct}])"
]
}
}
}
}
}
When analyzing text with the custom analyzer:
POST /my_thai_index/_analyze
{
"analyzer": "thai_with_punctuation",
"text": "(เปิด) ไม่ เป็น??? ??lol."
}
The response ignores punctuation:
{
"tokens": [
{
"token": "เปิด",
"start_offset": 1,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "ไม่",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 1
},
{
"token": "เป็น",
"start_offset": 11,
"end_offset": 15,
"type": "word",
"position": 2
},
{
"token": "lol",
"start_offset": 21,
"end_offset": 24,
"type": "word",
"position": 3
}
]
}
Retaining punctuation is crucial for my application because it's a requirement for another part of the system which adjusts the behaviour based on the punctuation information.
Is there a workaround or a different approach to achieve this without creating a custom Elasticsearch plugin?