How to stop storing special characters in content while indexing

481 views Asked by At

This is a sample document with the following points: Pharmaceutical Marketing Building – responsibilities.  Mass. – Aug. 13, 2020 –Â

How to remove the special characters or non ascii unicode chars from content while indexing? I'm using ES 7.x and storm crawler 1.17

2

There are 2 answers

0
Julien Nioche On BEST ANSWER

Looks like an incorrect detection of charset. You could normalise the content before indexing by writing a custom parse filter and remove the unwanted characters there.

2
Amit On

If writing a custom parse filter and normalization looks difficult for you. you can simply add the asciifolding token filter in your analyzer definition which would convert the non-ascii char to their ascii char as shown below

POST http://{{hostname}}:{{port}}/_analyze

{
    "tokenizer": "standard",
    "filter": [
        "asciifolding"
    ],
    "text": "Pharmaceutical Marketing Building â responsibilities.  Mass. â Aug. 13, 2020 âÂ"
}

And generated tokens for your text.

{
    "tokens": [
        {
            "token": "Pharmaceutical",
            "start_offset": 0,
            "end_offset": 14,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "Marketing",
            "start_offset": 15,
            "end_offset": 24,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "Building",
            "start_offset": 25,
            "end_offset": 33,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "a",
            "start_offset": 34,
            "end_offset": 35,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "responsibilities.A",
            "start_offset": 36,
            "end_offset": 54,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "A",
            "start_offset": 55,
            "end_offset": 56,
            "type": "<ALPHANUM>",
            "position": 5
        },
        {
            "token": "Mass",
            "start_offset": 57,
            "end_offset": 61,
            "type": "<ALPHANUM>",
            "position": 6
        },
        {
            "token": "a",
            "start_offset": 63,
            "end_offset": 64,
            "type": "<ALPHANUM>",
            "position": 7
        },
        {
            "token": "Aug",
            "start_offset": 65,
            "end_offset": 68,
            "type": "<ALPHANUM>",
            "position": 8
        },
        {
            "token": "13",
            "start_offset": 70,
            "end_offset": 72,
            "type": "<NUM>",
            "position": 9
        },
        {
            "token": "2020",
            "start_offset": 74,
            "end_offset": 78,
            "type": "<NUM>",
            "position": 10
        },
        {
            "token": "aA",
            "start_offset": 79,
            "end_offset": 81,
            "type": "<ALPHANUM>",
            "position": 11
        }
    ]
}