Collapsing Rarely Changing Diminsions to a Fact Column

312 views Asked by At

In star-schema design, dimension tables for fact table(s) are almost a must. In many business cases, the modeler can assume with acceptable confidence that certain dimension values will not change without control. For example the Gender is sometimes used as a dimension where in fact it could as well be a column in the fact table.

I am not sure if people always think about this. It is easier to just add a dimension and not think about it. However, as one of the Kimball rules, there should not bee too many dimensions for a fact table (I think the number he suggested is about 20).

I took gender as an example, but there are many other such as state name, city name, credit card type, etc.

My Question is:

What rule should one use to decide whether to embed/Collapse the value in the fact table vs. having a separate dimension for it? Some of the possible answers are: 1. When it does not change (e.g. gender). 2. When it has few possible values and have short length?

What else?

EDIT

Even though I considered the question answered, I still went to further research. There is a case when you have may have to use a dimension. The case is here: "Fact dimensions are frequently used to support drill through actions because the drill through action in SQL Server Analysis Services (SSAS) requires that you select the attributes from a dimension. So if you users want to see certain fields when they do a drill through, you must have those fields in a dimension."

The above was quoted from here Degenerate Dimensions

I think the subject needs further analysis for the interested person(s).

1

There are 1 answers

9
Jo Douglass On BEST ANSWER

It sounds like you're describing what Kimball calls "degenerate dimensions" - where you store the dimension value directly in the fact table. Kimball's own take on these is that you use them when you just have the one column for that dimension. It's often something that's very low cardinality with the fact. A good example is something like purchase order - you would have a purchase order fact table, and then have a column called PurchaseOrderReference which is actually a degenerate dimension, rather than having a purchase order dimension that's one-to-one with the fact.

However, the reason to keep the number of dimensions fairly low is to stop the fact table from becoming too wide - and if you start putting your dimension columns into your facts, they will become much wider, and performance will suffer. For instance, if you're adding large string columns or certain number formats, these could be bigger than the dimension key columns - and if you are adding several columns where you could get away with a single dimension key, the impact is much greater.

The solution for this is the denormalisation of dimensions. You mention state and city - these could well go into one location dimension together, then you just have one dimension key from your fact to the location dimension, instead of two (one to city and one to state). Your example of gender could potentially be a candidate for a degenerate dimension, but only if gender is the only attribute about people that you store. If you have a person dimension, or even have several attributes about people (gender, height, hair colour for instance), these would be better off in a dimension together.

You are correct that a lot of people don't design this way - I think this is because many people are used to more normalised databases where separate city and state tables would be normal (and correct), and so they incorrectly create many dimensions with few attributes where a few wider dimensions would be better.