I am performing some tests on vectorwise using a star schema and a single denormalized table. In the denormalized table I have 5 extra columns which are > varchar (100). I am new to the column oriented world and hence want to understand to get my facts right.
Few facts :
Query memory = 20G, Cache Memory is set to 7G
The size of name columns alone is around 6G, 6.5G, 2.6G and 650M. Now obviously these can't fit in my cache memory so system is I/O bound.
If I use a star schema approach it is working faster as then it has to cache integer columns from fact all of which can easily fit into my main memory cache.
I have a few questions surrounding this:
If I have to use a single table I need to increase my RAM to make sure all these columns to fit into memory ? I would like to know if someone else has done this kind of structure using Vertica or any other column oriented databases. If yes are you using the same approach all varchar columns should fit into RAM ?
The simple star schema query
select col1,col2, col3, sum(col4), sum(col5) from fact_table join tabl1 on condition join tab2 on condition where tab1.col1 = 1234 and tab2.col2 = 6789
Works slower than:
select col1,col2,col3,sumcol4, sumcol5 from (select col1,col2, col3, sum(col4) as sumcol4, sum(col5) as sumcol5 from fact_table where fact_table.col1 = 1234 and fact_table.col2 = 6789) as facts join tabl1 on condition join tab2 on condition
Why?
Note : Master Tables tab1 and tab2 are main memory resident.
I am using Vectorwise but just want to know if these observations are true for Vertica as well?
Vertica does not have the idea of keeping tables resident in memory. In some cases and in some kinds of joins, table data might already be in memory or get put there. But ultimately, Vertiac performance is dependent on how you store things on disk.
I would not expect the performance of thoe two queries in Vectorwise to have any bearing on how similar queries and tables would perform in Vertica. That said, the fewer number of rows involved with the join in the 2nd query suggests it would do better in various query execution engines.