Creating a custom tokenizer in ElasticSearch NEST

1.7k views Asked by At

I have a custom class in ES 2.5 of the following:

Title
DataSources
Content

Running a search is fine, except with the middle field - it's built/indexed using a delimiter of '|'.

ex: "|4|7|8|9|10|12|14|19|20|21|22|23|29|30"

I need to build a query that matches some in all fields AND matches at least one number in the DataSource field.

So to summarize what I currently have:

    QueryBase query = new SimpleQueryStringQuery
    {
        //DefaultOperator = !operatorOR ? Operator.And : Operator.Or,
        Fields = LearnAboutFields.FULLTEXT,
        Analyzer = "standard",
        Query = searchWords.ToLower()
    };
    _boolQuery.Must = new QueryContainer[] {query};

That's the search words query.

    foreach (var datasource in dataSources)
    {
        // Add DataSources with an OR
        queryContainer |= new WildcardQuery { Field = LearnAboutFields.DATASOURCE, Value = string.Format("*{0}*", datasource) };
    }
    // Add this Boolean Clause to our outer clause with an AND
    _boolQuery.Filter = new QueryContainer[] {queryContainer};
}

That's for the datasources query. There can be multiple datasources.

It doesn't work, and returns on results with the filter query added on. I think I need some work on the tokenizer/analyzer, but I don't know enough about ES to figure that out.

EDIT: Per Val's comments below I have attempted to recode the indexer like this:

        _elasticClientWrapper.CreateIndex(_DataSource, i => i
            .Mappings(ms => ms
                .Map<LearnAboutContent>(m => m
                    .Properties(p => p
                        .String(s => s.Name(lac => lac.DataSources)
                            .Analyzer("classic_tokenizer")
                            .SearchAnalyzer("standard")))))
            .Settings(s => s
                .Analysis(an => an.Analyzers(a => a.Custom("classic_tokenizer", ca => ca.Tokenizer("classic"))))));
        var indexResponse = _elasticClientWrapper.IndexMany(contentList);

It builds successfully, with data. However the query still isn't working right.

New query for DataSources:

        foreach (var datasource in dataSources)
        {
            // Add DataSources with an OR
            queryContainer |= new TermQuery {Field = LearnAboutFields.DATASOURCE, Value = datasource};
        }
        // Add this Boolean Clause to our outer clause with an AND
        _boolQuery.Must = new QueryContainer[] {queryContainer};

And the JSON:

{"learnabout_index":{"aliases":{},"mappings":{"learnaboutcontent":{"properties":{"articleID":{"type":"string"},"content":{"type":"string"},"dataSources":{"type":"string","analyzer":"classic_tokenizer","search_analyzer":"standard"},"description":{"type":"string"},"fileName":{"type":"string"},"keywords":{"type":"string"},"linkURL":{"type":"string"},"title":{"type":"string"}}}},"settings":{"index":{"creation_date":"1483992041623","analysis":{"analyzer":{"classic_tokenizer":{"type":"custom","tokenizer":"classic"}}},"number_of_shards":"5","number_of_replicas":"1","uuid":"iZakEjBlRiGfNvaFn-yG-w","version":{"created":"2040099"}}},"warmers":{}}}

The Query JSON request:

{
  "size": 10000,
  "query": {
    "bool": {
      "must": [
        {
          "simple_query_string": {
            "fields": [
              "_all"
            ],
            "query": "\"housing\"",
            "analyzer": "standard"
          }
        }
      ],
      "filter": [
        {
          "terms": {
            "DataSources": [
              "1"
            ]
          }
        }
      ]
    }
  }
}
2

There are 2 answers

16
Val On BEST ANSWER

One way to achieve this is to create a custom analyzer with a classic tokenizer which will break your DataSources field into the numbers composing it, i.e. it will tokenize the field on each | character.

So when you create your index, you need to add this custom analyzer and then use it in your DataSources field:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "number_analyzer": {
          "type": "custom",
          "tokenizer": "number_tokenizer"
        }
      },
      "tokenizer": {
        "number_tokenizer": {
          "type": "classic"
        }
      }
    }
  },
  "mappings": { 
    "my_type": {
      "properties": {
        "DataSources": {
          "type": "string",
          "analyzer": "number_analyzer",
          "search_analyzer": "standard"
        }
      }
    }
  }
}

As a result, if you index the string "|4|7|8|9|10|12|14|19|20|21|22|23|29|30", you DataSources field will effectively contain the following array of token: [4, 7, 8, 9, 10, 12, 14, 191, 20, 21, 22, 23, 29, 30]

Then you can get rid of your WildcardQuery and simply use a TermsQuery instead:

terms = new TermsQuery {Field = LearnAboutFields.DATASOURCE, Terms = dataSources }
// Add this Boolean Clause to our outer clause with an AND
_boolQuery.Filter = new QueryContainer[] { terms };
1
GWilkinson On

At an initial glance at your code I think one problem you might have is that any queries placed within a filter clause will not be analysed. So basically the value will not be broken down into tokens and will be compared in its entirety.

It's easy to forget this so any values that require analysis need to be placed in the must or should clauses.