I have a table that's used as a queue:
create table items
(
itemId int, -- PK, identity
status int, -- Possible values: 0 = Pending, 1 = Processing, 2 = Processed
createdAt datetime2,
updatedAt datetime2,
content text
)
Producers add records to the queue and consumers read them FIFO:
- Producers add records with
Pendingstatus withcreatedAtas current time - Consumers select records in
Pendingstatus ordered bycreatedAt. When consuming, they mark the record asProcessingand set theupdatedAtto current time (using anupdate/selectCTE) - After processing, consumers mark records as
Processed - While processing, consumers may crash and therefore won't be able to mark the record as
Processed - When another consumer finds a record that's stuck in
Processingstate for longer than x minutes (i.e.updatedAt < current_time - xandstatus = Processing) they pick them up and process (let's assume the new consumer won't crash ;) ) - The table has about 1M records and grows about 20k a day
- At any given time there will be < 100
PendingandProcessingrecords
I have 2 questions
- Given this scenario (especially the last 2 points), would an index on
(status, createdAt)withupdatedAtas an included column be a good index?
I tried that and it shows that the index is hit and execution time is very fast (sub-second). However, I'm not quite sure this low cardinality index (with starting column status) is a good index because such indexes are generally considered bad. I'm wondering if it works in my case because of the highly uneven distribution of the possible values (<1% of records are Pending, InProgress which is what I'm querying on. No queries are run to select Processed ones).
- I added the included column (
updatedAt) to support this filterstatus = Processing and updatedAt < current_time - xbut I'm not sure it's useful. Does the query planner care about the included column or only looks at the columns in the index (status, createdAt)?
Bonus points if you answer both questions ;)
One of the main purposes of an index is to reduce the number of rows being read from the table. A low cardinality index means that a column only takes on a handful of values. So, if a table has a ten million rows and there are ten values, then each value would have -- on average -- a million rows.
An index is not helpful for fetching a million rows from a ten million row table, because every (or just about every) data page will have a matching row. The purpose is to reduce the number of data pages being read.
So, your use of an index is quite reasonable, because there are only a few rows you are finding. You are using the index to find non-processed rows, and there are few of these.
Your index is much larger than necessary because it has information about processed rows. You might consider a filtered index. From what you describe, I think this would be:
Sometimes, in these situations, you want to use a clustered index on
state. Basically, this allows the "processed" items to grouped together. And, if the data pages they are on are not accessed, then those data pages do not need to be loaded.In you case, though, I am guessing that items are added sequentially so only more recent items are being processed. Older data pages will be filled with processed items and -- because they are never referenced -- may not even occupy space in memory.