Star schema - splitting dimensions into multiple smaller dimensions

527 views Asked by At

I have been working in dashboarding for a long time and have always been using the traditional kimball star schema to build my reporting and dashboarding on. The architecture team at work have been working on automating the dimensional star schema. The automation brings a noticeable difference with what I am used to.

What they started doing: creating mini-dimensions based on each business combination of hub and link of the DV warehouse. For instance, the client dimension is splitted in 5 based on attributes affinity and change frequency (dim_client_address, dim_client_physical_attributes, dim_client_sociodemo, dim_client_common, ...). The product is splitted in 4 mini dimensions. And so on for all dimensions. All these mini dimensions are using the same NK but different SK. All mini dims have the same number of rows but different attributes.

I'm getting told that this:

  • is more performance since I'm querying a subset of columns at a time (my queries are generated by reporting tools)
  • is more business relevant since the grouping are based on business concepts

This looks like a good idea when presented to me but before going all in I want to make sure that we do not have some blond spots. For instance, to me this makes it a little less easier to query which is a basic principle of dimensional star schema.

It seems to me that it is a form of centipede fact table but not sure since I can't find examples on the web.

Is this something you have encountered?

Thank you! (sorry if the english is not perfect, it's not my daily language)

3

There are 3 answers

0
NickW On

The article you've linked to conveniently stops before describing how a "star schema" built in this way would actually be queried.

The most "expensive" part of any SQL query is the joins. Imagine in your model if the Customer Dim has been split into 3 mini-Dims and you need to pick attributes in your query and they exist across those 3 mini-Dims; your query will now have to perform 3 joins. Multiply that by lots of "big" Dims and the number of joins in any query will explode and is highly likely to kill the performance of your queries.

Seems like someone has come up with a way of making it easier to build a "star schema" when the starting point is a Datavault model but hasn't actually thought it through to the end.

I would suggest going back to your architecture team and getting them to demonstrate that query performance using their approach will be acceptable - using realistic data volumes and real-world queries

0
Rich On

A good dimensional model aimed at reporting focuses on report query performance and understandability. That’s why the Kimball method was so successful: it made typical reporting queries faster, and easier for reporting users to navigate their way around.

More tables and more joins will likely make it harder to do both these things. It feels like this approach has more to do with automation and ease of producing the model than it is making a specific database that is great for reporting.

2
Nick.Mc On

An end user doesn't care if dimensions are split into one or many tables because they use a semantic model that glues it together in a nice presentable way for them, possibly in folders or different "tables" (whatever the tool calls them)

Since all the "sub" dimensions have the same number of rows, then from a physical modelling perspective this is really just "vertical partitioning" of the dimension. There are plenty of arguments that you can find for and against vertical partiioning.

Addressing your points;

  • is more performance since I'm querying a subset of columns at a time (my queries are generated by reporting tools)

This is really a database performance question. The first question is: do you actually have a noticeable performance issue in your database due to this? There are other transparent options to help with this, for example columnar storage (in SQL Server this is called clustered columnstore) allows queries to only select the columns of interest instead of the entire data page.

  • is more business relevant since the grouping are based on business concepts

As per above comment, unless your end users are writing SQL queries directly, this is grouping belongs in your semantic model (i.e. the model you reporting tool presents) not necessarily your physical model.

I'm always wary of technology for technologies sake, and in my opinion DV can be one of the worst offenders. Does all the extra data engineering required to implement this actually provide any business benefit?