ClickHouse offers Bloom filter indexes:explore their functionality and implementation methods to see how these probabilistic data structures can accelerate queries and lessen computational burden
What is the Bloom filter index in ClickHouse
Implementation methods of leveraging Bloom Filters to optimize query performance and reduce computational load.
Let's start at the top first: every
MergeTreetable has a primary index, which is built from the primary key. The primary index consists of the values of the primary of the first row of each granule, which is every 8,192 rows. The table data is sorted on disk by the primary key (or the sort order if the two are different, but the primary key must be a subset of the sort order).If you want, you can define a secondary index on a
MergeTreetable, which is also referred to as a skipping index. What do they skip? They skip your existing granules. How? Well, it depends on how you define the skipping index. The are different types, likeset,minmaxandbloom_filter.For example, a
minmaxskipping index stores the minimum and maximum value of a specific column for each existing granule. More accurately, for theGRANULARITYof the skipping index - which is a multiple of the table's granule size. For example, a skipping index withGRANULARITY10 will contain the values from 10 consecutive granules, or 81,920 rows. So that is how many rows you could potentially skip using this index. Let's call this a block of rows.The
bloom_filterskipping index stores the values from each block in a Bloom filter data structure (similar to whatsetdoes exceptsetattempts to store every value of the column from each block, while a Bloom filter is a bitmap of 1's and 0's - but I assume you know that already).So that is where Bloom filters exist in ClickHouse. All the usual Bloom filter rules apply here: you can get false positives -but there is no harm there except for the unnecessary processing of rows. Your
WHEREclause will filter out any false positives.I hope that helps...