Elasticsearch aggregation on part of string, not full string

1k views Asked by At

Basically, what I'm trying to do here is get the second-level-down categories from a hierarchically stored string. The problem is that the level of hierarchy vary and one product category could have six levels and another only four, otherwise I would have just implemented predefined levels.

I have some products with categories like so:

[
  {
    title: 'product one',
    categories: [
      'clothing/mens/shoes/boots/steel-toe'
    ]
  },
  {
    title: 'product two',
    categories: [
      'clothing/womens/tops/sweaters/open-neck'
    ]
  },
  {
    title: 'product three',
    categories: [
      'clothing/kids/shoes/sneakers/light-up'
    ]
  },
  {
    title: 'product etc.',
    categories: [
      'clothing/baby/bibs/super-hero'
    ]
  }, 
  ... more products
]

I'm trying to get aggregation buckets like so:

buckets: [
  {
    key: 'clothing/mens',
    ...
  },
  {
    key: 'clothing/womens',
    ...
  },
  {
    key: 'clothing/kids',
    ...
  },
  {
    key: 'clothing/baby',
    ...
  },
]

I've tried looking at filter prefixes, includes and excludes on terms, but I can't find anything that works. Please someone point me in the right direction.

1

There are 1 answers

3
Andrei Stefan On BEST ANSWER

Your category field should be analyzed with a custom analyzer. Maybe you have some other plans with the category, so I'll just add a subfield used only for aggregations:

{
  "settings": {
    "analysis": {
      "filter": {
        "category_trimming": {
          "type": "pattern_capture",
          "preserve_original": false,
          "patterns": [
            "(^\\w+\/\\w+)"
          ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "filter": [
            "category_trimming",
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "category": {
          "type": "string",
          "fields": {
            "just_for_aggregations": {
              "type": "string",
              "analyzer": "my_analyzer"
            }
          }
        }
      }
    }
  }
}

Test data:

POST /index/test/_bulk
{"index":{}}
{"category": "clothing/womens/tops/sweaters/open-neck"}
{"index":{}}
{"category": "clothing/mens/shoes/boots/steel-toe"}
{"index":{}}
{"category": "clothing/kids/shoes/sneakers/light-up"}
{"index":{}}
{"category": "clothing/baby/bibs/super-hero"}

The query itself:

GET /index/test/_search?search_type=count
{
  "aggs": {
    "by_category": {
      "terms": {
        "field": "category.just_for_aggregations",
        "size": 10
      }
    }
  }
}

The results:

   "aggregations": {
      "by_category": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "clothing/baby",
               "doc_count": 1
            },
            {
               "key": "clothing/kids",
               "doc_count": 1
            },
            {
               "key": "clothing/mens",
               "doc_count": 1
            },
            {
               "key": "clothing/womens",
               "doc_count": 1
            }
         ]
      }
   }