SQL Server choosing inefficient execution plan

878 views Asked by At

I've got a query that gets run in certain circumstances with an 'over-simplified' execution plan that actually turns out to be quite slow (3-5 seconds). The query is:

SELECT DISTINCT Salesperson.*
FROM Salesperson
    INNER JOIN SalesOrder on Salesperson.Id = SalesOrder.SalespersonId
    INNER JOIN PrelimOrder on SalesOrder.Id = PrelimOrder.OrderId
    INNER JOIN PrelimOrderStatus on PrelimOrder.CurrentStatusId = PrelimOrderStatus.Id
    INNER JOIN PrelimOrderStatusType on PrelimOrderStatus.StatusTypeId = PrelimOrderStatusType.Id
WHERE 
    PrelimOrderStatusType.StatusTypeCode = 'Draft'
    AND Salesperson.EndDate IS NULL

and the slow execution plan looks like:

Slow Execution Plan

The thing that stands out straight away is that the actual number of rows/executions is significantly higher than the respective estimates:

IX_SalesOrder_SalespersonId

UX_PrelimOrder_OrderId

UX_PrelimOrderStatus_Id

If I remove the Salesperson.EndDate IS NULL clause, then a faster, parallelized execution plan is run:

Fast Execution Plan

A similar execution plan also runs quite fast if I remove the DISTINCT keyword.

From what I can gather, it seems that the optimiser decides, based on its incorrect estimates, that the query won't be costly to run and therefore doesn't choose the parallelized plan. But I can't for the life of me figure out why it is choosing the incorrect plan. I have checked my statistics and they are all as they should be. I have tested in both SQL Server 2008 to 2016 with identical results.

2

There are 2 answers

1
deroby On

As @Gordon Linoff already said, DISTINCT usually is bad news for performance. Often it means you're amassing way too much data and then squeezing it back together in a more compact set. Better to keep it small all throughout the process, if possible.

Also, it's kind of counter-intuitive that the query plan with index scans turns out to be faster than the one with index seeks; it seems (in this case) parallelism makes up for it. You could try playing around with the Cost Threshold For Parallelism Option but beware that this is a server-wide setting! (then again, in my opinion the default of 5 is rather high for most use-cases I've run into personally; CPU's are aplenty these days, time still isn't =).

Bit of a long reach, but I was wondering if you could 'split' the query in 2, thus eliminating (a small) part of the guesswork of the server. I'm assuming here that StatusTypeCode is unique. (verify the datatype of the variable too!)

DECLARE @StatusTypeId int

SELECT @StatusTypeId = Id 
  FROM PrelimOrderStatusType 
 WHERE StatusTypeCode = 'Draft'

SELECT Salesperson.*
  FROM Salesperson
 WHERE Salesperson.EndDate IS NULL
   AND EXISTS ( SELECT * 
                  FROM SalesOrder 
                    ON SalesOrder.SalespersonId = Salesperson.Id
                  JOIN PrelimOrder 
                    ON PrelimOrder.OrderId = SalesOrder.Id 
                  JOIN PrelimOrderStatus 
                    ON PrelimOrderStatus.Id = PrelimOrder.CurrentStatusId
                   AND PrelimOrderStatus.StatusTypeId = @StatusTypeId)

If it doesn't help, could you give give the definition of the indexes that are being used?

1
Gordon Linoff On

SELECT DISTINCT is expensive. So, it is best to avoid it. Something like this:

SELECT sp.*
FROM Salesperson sp
WHERE EXISTS (SELECT 1
              FROM SalesOrder so INNER JOIN
                   PrelimOrder po 
                   ON so.Id = po.OrderId INNER JOIN
                   PrelimOrderStatus pos
                   ON po.CurrentStatusId = pos.Id INNER JOIN
                   PrelimOrderStatusType post
                   ON pos.StatusTypeId = post.Id
              WHERE sp.Id = so.SalespersonId AND
                    post.StatusTypeCode = 'Draft'
             ) AND
      sp.EndDate IS NULL;

Note: An index on SalesPerson(EndDate, Id) would be helpful.