I have a DELETE queries in Redshift that takes up to 40 seconds in productions. The queries are created programatically is looks like
EXPLAIN DELETE FROM platform.myTable WHERE id IN ('77258ef98a7a87a0.W01.7388.c930db5e66203047','77258ef98319adc4.W01.73ad.c930db5e66203047','77258ef97d8ff761.W01.73be.c930db5e66203047','77258ef985a5c1be.W01.738a.c930db5e66203047')
And the result of the EXPLAIN query shows
XN Seq Scan on mytable (cost=0.00..23725.54 rows=5 width=6)
Filter: (((id)::text = '77258ef97d8ff761.W01.73be.c930db5e66203047'::text) OR ((id)::text = '77258ef98319adc4.W01.73ad.c930db5e66203047'::text) OR ((id)::text = '77258ef985a5c1be.W01.738a.c930db5e66203047'::text) OR ((id)::text = '77258ef98a7a87a0.W01.7388.c930db5e66203047'::text))
And my table schema looks like
CREATE TABLE IF NOT EXISTS platform.myTable(
userId varchar(150) encode lzo,
eventTime timestamp encode delta32k,
id varchar(150) encode lzo,
typeId varchar(150) encode lzo,
...
)
distkey (typeId)
compound sortkey (eventTime, typeId, userId ... id);
Would adding a primary key of 'id' to the table make the DELETE query a point looking up instead of a scan and filter operation?
I suggest you first make sure that the delete query's statement time is coming from the execution and not from some other aspects of the statement completing. EXPLAIN doesn't always give the full picture of what really happened. Take a look at STL_SCAN and see how long that scan is taking. For this query it should be the majority of the execution time. 40 seconds is a long time for scanning so it will be telling if this is or isn't the case.
If the scan is taking the majority of the time then heading down the route you are thinking sounds right. A primary key may help some but I expect the biggest change you can make is to change up the sort keys so that id comes first (or possibly second) as this will allow for less data to be scanned from disk (table must be analyzed). If the scan is the issue then scanning less should make things much faster.
If scan isn't the big time consumer then find what is will be key. Start by looking at STL_QUERY_METRICS to find where the time is spent in execution. But that isn't the only place time could be going - look at STL_WLM_QUERY and see if moving through the queue is where the time is being lost. Lastly, you didn't specify how the time was measured it is possible that the COMMIT is part of the time being reported. If this is the case look at STL_COMMIT_STATS to be sure that the coherency management isn't bogged down.
With all these looks into the query's run time the activity on the cluster can be a big factor. WLM and COMMIT processes are highly sensitive to what else is happening. Even scanning of the table can be impacted if other queries are performing large data scans.