Set the right partitions for Crate Database

264 views Asked by At

I am modelling for the Database CrateDB.
I have an avg. of 400 customers and the produce different amounts of time-series data every day. (Between 5K and 500K; avg. ~15K)

Later I should be able to query per customer_year_month and per customer_year_calendar_week. That means that I will only query for the intervals:

  • week
  • and month

Now I'am asking myself how to partition this table?

I would partion per customer and year.
Does this make sense?
Or would it be better to partion by customer, year and month?

1

There are 1 answers

0
claus On BEST ANSWER

so the question of partitioning a table is quite complex and should consider a lot of things. Among others:

  • What queries should be run?
  • The way the data is inserted
  • Available hardware resources
  • Cluster size

Essentially, each partition also creates overhead by multiplying the shard count (a partition can be considered a "sub-table" based on a column value), which - if chosen improperly - can hinder performance a lot. So in your case 15k inserts a day is not too much, however the distribution of inserts might cause problems, a customer's partition that grows with 500k inserts a day will run into performance problems earlier than the 5k person. As a consequence I would use weekly partitioning only.

create table "customer-logging" (
    customer_id long, 
    log string, 
    ts timestamp, 
    week as date_trunc('week', ts)
) partitioned by (week) into 8 shards

Please only use 8 shards if you have an appropriate amount of CPU cores ;)

Docs: date_trunc(), partitioned tables

Ideally you try out a few different combinations and find what works best for you. Insights into shard sizes and locations are provided by our sys tables, so you can see if there's a particularly fat shard that overloads a node ;)

Cheers, Claus