Elasticsearch average over date histogram buckets

7.5k views Asked by At

I've got a bunch of documents indexed in ElasticSearch, and I need to get the following data:

For each month, get the average number of documents per working day of the month (or if impossible, use 20 days as the default).

I already aggregated my data into months buckets using the date histogram aggregation. I tried to nest a stats bucket, but this aggregations uses data extracted from the document's field, not from the parent bucket.

Here is my query so far:

{
    "query": {
        "match_all": {}
    },
    "aggs": {
        "docs_per_month": {
            "date_histogram": {
                "field": "created_date",
                "interval": "month",
                "min_doc_count": 0
            }
            "aggs": {
                '???': '???'
            }
        }
    }
}

edit

To make my question clearer, what I need is:

  • Get the total of numbers of documents created for the month (which is already done thanks to the date_histogram aggregation)
  • Get the number of working days for the month
  • Divide the first by the second.

4

There are 4 answers

0
Andrei Stefan On BEST ANSWER

What you basically need is something like this (which doesn't work, as it's not an available feature):

{
  "query": {
    "match_all": {}
  },
  "aggs": {
    "docs_per_month": {
      "date_histogram": {
        "field": "date",
        "interval": "month",
        "min_doc_count": 0
      },
      "aggs": {
        "average": {
          "avg": {
            "script": "doc_count / 20"
          }
        }
      }
    }
  }
}

It doesn't work because there is not way of accessing the doc_count from the "parent" aggregation.

But, this will be possible in the 2.x branch of Elasticsearch and, at the moment, it's being actively developed: https://github.com/elastic/elasticsearch/issues/8110 This new feature will add a second layer of manipulation over the results (buckets) of an aggregation and it's not only your usecase, but many others.

Unless you want to try some ideas out there or perform your own calculations in your app, you need to wait for this feature.

3
Julien C. On

You want to exclude documents with timestamp on Saturday and Sunday, so you can exclude those documents in your query using a script

{
  "query": {
    "filtered": {
      "filter": {
        "script": {
          "script": "doc['@timestamp'].date.dayOfWeek != 7 && doc['@timestamp'].date.dayOfWeek != 6"
        }
      }
    }
  },
  "aggs": {
    "docs_per_month": {
      "date_histogram": {
        "field": "created_date",
        "interval": "month",
        "min_doc_count": 0
      },
      "aggs": {
        "docs_per_day": {
          "date_histogram": {
            "field": "created_date",
            "interval": "day",
            "min_doc_count": 0
          }
        },
        "aggs": {
          "docs_count": {
            "avg": {
              "field": ""
            }
          }
        }
      }
    }
  }
}

You may not need the first aggregation by month, since you already have this information using day interval

BTW you need to make sure dynamic scripting is enabled by adding this to your elasticsearch.yml configuration

script.disable_dynamic: false

Or add a groovy script under /config/scripts and use a filtered query with a script in filter

0
Val On

There is a pretty convoluted solution and not really performant, using the following scripted_metric aggregation.

{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggs": {
    "docs_per_month": {
      "date_histogram": {
        "field": "created_date",
        "interval": "month",
        "min_doc_count": 0
      },
      "aggs": {
        "avg_doc_per_biz_day": {
          "scripted_metric": {
            "init_script": "_agg.bizdays = []; _agg.allbizdays = [:]; start = new DateTime(1970, 1, 1, 0, 0); now = new DateTime(); while (start < now) { def end = start.plusMonths(1); _agg.allbizdays[start.year + '_' + start.monthOfYear] = (start.toDate()..<end.toDate()).sum {(it.day != 6 && it.day != 0) ? 1 : 0 }; start = end; }",
            "map_script": "_agg.bizdays << _agg.allbizdays[doc. created_date.date.year+'_'+doc. created_date.date.monthOfYear]",
            "combine_script": "_agg.allbizdays = null; doc_count = 0; for (d in _agg.bizdays){ doc_count++ }; return doc_count / _agg.bizdays[0]",
            "reduce_script": "res = 0; for (a in _aggs) { res += a }; return res"
          }
        }
      }
    }
  }
}

Let's detail each script below.

What I'm doing in init_script is creating a map of the number of business days for each month since 1970 and storing that in the _agg.allbizdays map.

_agg.bizdays = [];
_agg.allbizdays = [:]; 
start = new DateTime(1970, 1, 1, 0, 0);
now = new DateTime();
while (start < now) { 
    def end = start.plusMonths(1);     
    _agg.allbizdays[start.year + '_' + start.monthOfYear] = (start.toDate()..<end.toDate()).sum {(it.day != 6 && it.day != 0) ? 1 : 0 }; 
    start = end; 
}

In map_script, I'm simply retrieving the number of weekdays for the month of each document;

_agg.bizdays << _agg.allbizdays[doc.created_date.date.year + '_' + doc. created_date.date.monthOfYear];

In combine_script, I'm summing up the average doc count for each shard

_agg.allbizdays = null;
doc_count = 0; 
for (d in _agg.bizdays){ doc_count++ }; 
return doc_count / _agg.bizdays[0];

And finally in reduce_script, I'm summing up the average doc count for each node:

res = 0; 
for (a in _aggs) { res += a }; 
return res

Again I think it's pretty convoluted and as Andrei rightly said it, it is probably better to wait for 2.0 to make it work the way it should, but in the meantime you have this solution, if you need it.

2
dularion On

For anyone still interested, you can now do with with the avg_bucket aggregation. Its still a bit tricky, because you cannot simply run the avg_bucket on a date_historgram aggregation result, but with a secondary value_count aggregation with some unique value and it works fine :)

{
  "size": 0,
  "aggs": {
    "orders_per_day": {
      "date_histogram": {
        "field": "orderedDate",
        "interval": "day"
      },
      "aggs": {
        "amount": {
          "value_count": {
            "field": "dateCreated"
          }
        }
      }
    },
    "avg_daily_order": {
      "avg_bucket": {
        "buckets_path": "orders_per_day>amount"
      }
    }
  }
}