I have been working in dashboarding for a long time and have always been using the traditional kimball star schema to build my reporting and dashboarding on. The architecture team at work have been working on automating the dimensional star schema. The automation brings a noticeable difference with what I am used to.
What they started doing: creating mini-dimensions based on each business combination of hub and link of the DV warehouse. For instance, the client dimension is splitted in 5 based on attributes affinity and change frequency (dim_client_address, dim_client_physical_attributes, dim_client_sociodemo, dim_client_common, ...). The product is splitted in 4 mini dimensions. And so on for all dimensions. All these mini dimensions are using the same NK but different SK. All mini dims have the same number of rows but different attributes.
I'm getting told that this:
- is more performance since I'm querying a subset of columns at a time (my queries are generated by reporting tools)
- is more business relevant since the grouping are based on business concepts
This looks like a good idea when presented to me but before going all in I want to make sure that we do not have some blond spots. For instance, to me this makes it a little less easier to query which is a basic principle of dimensional star schema.
It seems to me that it is a form of centipede fact table but not sure since I can't find examples on the web.
Is this something you have encountered?
Thank you! (sorry if the english is not perfect, it's not my daily language)
The article you've linked to conveniently stops before describing how a "star schema" built in this way would actually be queried.
The most "expensive" part of any SQL query is the joins. Imagine in your model if the Customer Dim has been split into 3 mini-Dims and you need to pick attributes in your query and they exist across those 3 mini-Dims; your query will now have to perform 3 joins. Multiply that by lots of "big" Dims and the number of joins in any query will explode and is highly likely to kill the performance of your queries.
Seems like someone has come up with a way of making it easier to build a "star schema" when the starting point is a Datavault model but hasn't actually thought it through to the end.
I would suggest going back to your architecture team and getting them to demonstrate that query performance using their approach will be acceptable - using realistic data volumes and real-world queries