I have a Greenplum database, in which I have 10 segments reflecting the 10 hard disks. My table is divided into primary partition based on date and secondary partition based on hash id. So for a month there will be 30 primary partition and each partition containing 100 sub partition. And the data in the sub-partition is loaded based on hashid. Now the question is how these partitions are distributed among the segments.
Guess 1:
seg1(equally distributed based on pri partition : 30datepartition/10=3 date partitions)
date1---0-99 sub partition of hashid
date2---0-99 sub partition of hashid
date3---0-99 sub partition of hashid
seg2(equally contains 30/10=3 date partitions)
date4---0-99 partition of hashid
date5---0-99 partition of hashid
date6---0-99 partition of hashid
...
..
seg10
date27---0-99 partition of hashid
date28---0-99 partition of hashid
date29---0-99 partition of hashid
OR
Guess 2
seg1(distributed by 100hashid/10=10 hashid partitions)
date1---0-9 partition of hashid
date2---0-9 partition of hashid
...
date30---0-9 partition of hashid
seg2(equally contains 100hashid/10=10 hashid partitions)
date1---10-19 partition of hashid
date2---10-19 partition of hashid
...
date30---10-19 partition of hashid
How does this work? Guess 1 or 2, if both are wrong kindly en light me on the way it is distributed on segment level.
Is it a good design to sub-partition it based on hash id. As I am dealing with 6 million records every day, and i have to store date for a year, i wanted the search to hit on very less portion of data. In other words based on the key query i will determine and hashid range and it will search in those specific partitions.
Thanks Ganesh.R
When you create a table, and distribution key is any key say event_id and distribution is done on the basis of any date column for example event_date, Best way is partition by column should be the part of distribution key to properly distribute the data / for skewness,
Thanks