How to stop storing special characters in content while indexing

Question

How to stop storing special characters in content while indexing

500 views Asked by VenkateshPamulapati At 16 October 2020 at 07:16

This is a sample document with the following points: Pharmaceutical Marketing Building â€“ responsibilities.Â Â Mass. â€“ Aug. 13, 2020 â€“Â

How to remove the special characters or non ascii unicode chars from content while indexing? I'm using ES 7.x and storm crawler 1.17

Original Q&A

There are 2 answers

Amit On 16 October 2020 at 07:40

If writing a custom parse filter and normalization looks difficult for you. you can simply add the asciifolding token filter in your analyzer definition which would convert the non-ascii char to their ascii char as shown below

POST http://{{hostname}}:{{port}}/_analyze

{
    "tokenizer": "standard",
    "filter": [
        "asciifolding"
    ],
    "text": "Pharmaceutical Marketing Building â responsibilities.Â Â Mass. â Aug. 13, 2020 âÂ"
}

And generated tokens for your text.

{
    "tokens": [
        {
            "token": "Pharmaceutical",
            "start_offset": 0,
            "end_offset": 14,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "Marketing",
            "start_offset": 15,
            "end_offset": 24,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "Building",
            "start_offset": 25,
            "end_offset": 33,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "a",
            "start_offset": 34,
            "end_offset": 35,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "responsibilities.A",
            "start_offset": 36,
            "end_offset": 54,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "A",
            "start_offset": 55,
            "end_offset": 56,
            "type": "<ALPHANUM>",
            "position": 5
        },
        {
            "token": "Mass",
            "start_offset": 57,
            "end_offset": 61,
            "type": "<ALPHANUM>",
            "position": 6
        },
        {
            "token": "a",
            "start_offset": 63,
            "end_offset": 64,
            "type": "<ALPHANUM>",
            "position": 7
        },
        {
            "token": "Aug",
            "start_offset": 65,
            "end_offset": 68,
            "type": "<ALPHANUM>",
            "position": 8
        },
        {
            "token": "13",
            "start_offset": 70,
            "end_offset": 72,
            "type": "<NUM>",
            "position": 9
        },
        {
            "token": "2020",
            "start_offset": 74,
            "end_offset": 78,
            "type": "<NUM>",
            "position": 10
        },
        {
            "token": "aA",
            "start_offset": 79,
            "end_offset": 81,
            "type": "<ALPHANUM>",
            "position": 11
        }
    ]
}

**Julien Nioche** · Accepted Answer · 2020-10-16T07:32:51+00:00

Julien Nioche On 16 October 2020 at 07:32 BEST ANSWER

Looks like an incorrect detection of charset. You could normalise the content before indexing by writing a custom parse filter and remove the unwanted characters there.

TechQA.

How to stop storing special characters in content while indexing

There are 2 answers

Related Questions in ELASTICSEARCH

Related Questions in STORMCRAWLER

Related Questions in ELASTICSEARCH-ANALYZERS

Popular Questions

Popular Tags

Trending Questions