Optimize SQL: Customers that haven't ordered for x days

558 views Asked by At

I have created this SQL in order to find customers that haven't ordered for X days.

It is returning a result set, so this post is mainly just to get a second opinion on it, and possible optimizations.

SELECT o.order_id,
       o.order_status,
       o.order_created,
       o.user_id,
       i.identity_firstname,
       i.identity_email,

  (SELECT COUNT(*)
   FROM orders o2
   WHERE o2.user_id=o.user_id
     AND o2.order_status=1) AS order_count,

  (SELECT o4.order_created
   FROM orders o4
   WHERE o4.user_id=o.user_id
     AND o4.order_status=1
   ORDER BY o4.order_created DESC LIMIT 1) AS last_order
FROM orders o
INNER JOIN user_identities ui ON o.user_id=ui.user_id
INNER JOIN identities i ON ui.identity_id=i.identity_id
   AND i.identity_email!=''
INNER JOIN subscribers s ON i.identity_id=s.identity_id
  AND s.subscriber_status=1
  AND s.subsriber_type=e
  AND s.subscription_id=1
WHERE DATE(o.order_created) = "2013-12-14"
  AND o.order_status=1
  AND o.user_id NOT IN
    (SELECT o3.user_id
     FROM orders o3
     WHERE o3.user_id=o.user_id
       AND o3.order_status=1
       AND DATE(o3.order_created) > "2013-12-14")

Can you guys find any potential problems with this SQL? Dates are dynamically inserted.

The final SQL that I put in production, will basically only include o.order_id, i.identity_id and o.order_count - this order_count will need to be correct. The other selected fields and 'last_order' subquery will not be included, it's only for testing.

This should give me a list of users that have their last order on that particular day, and is a newsletter subscriber. I am particular in doubt about correctness of the NOT IN part in the WHERE clause, and the order_count subquery.

2

There are 2 answers

5
Bohemian On BEST ANSWER

There are several problems:

A. Using functions on indexable columns

You are searching for orders by comparing DATE(order_created) with some constant. This is a terrible idea, because a) the DATE() function is executed for every row (CPU) and b) the database can't use an index on the column (assuming one existed)

B. Using WHERE ID NOT IN (...)

Using a NOT IN (...) is almost always a bad idea, because optimizers usually have trouble with this construct, and often get the plan wrong. You can almost always express it as an outer join with a WHERE condition that filters for misses using an IS NULL condition for a joined column (and adds the side benefit of not needing DISTINCT, because there's only ever one miss returned)

C. Leaving joins that filtering out of large portions of rows too late

The earlier you can mask off rows by not making joins the better. You can do this by joining less likely to match tables earlier in the joined table list, and by putting non-key conditions into join rather than the where clause to get the rows excluded as early as possible. Some optimizers to this anyway, but I've often found they don't

D. Avoid correlated subqueries like the plague!

You have several correlated subqueries - ones that are executed for every row of the main table. That's really an incredibly bad idea. Again sometimes the optimizer can craft them into a join, but why rely (hope) on that. Most correlated subqueries can be expressed as a join; you examples are no exception.

With the above in mind, there are some specific changes:

  • o2 and o4 are the same join, so o4 may be dispensed with entirely - just use o2 after conversion to a join
  • DATE(order_created) = "2013-12-14" should be written as order_created between "2013-12-14 00:00:00" and "2013-12-14 23:59:59"

This query should be what you want:

SELECT
    o.order_id,
    o.order_status,
    o.order_created,
    o.user_id,
    i.identity_firstname,
    i.identity_email,
    count(o2.user_id) AS order_count,
    max(o2.order_created) AS last_order
FROM orders o
LEFT JOIN orders o2 ON o2.user_id = o.user_id AND o2.order_status=1
LEFT JOIN orders o3 ON o3.user_id = o.user_id 
    AND o3.order_status=1
    AND o3.order_created >= "2013-12-15 00:00:00"
JOIN user_identities ui ON o.user_id=ui.user_id
JOIN identities i ON ui.identity_id=i.identity_id AND i.identity_email != ''
JOIN subscribers s ON i.identity_id=s.identity_id
  AND s.subscriber_status=1
  AND s.subsriber_type=e
  AND s.subscription_id=1
WHERE o.order_created between "2013-12-14 00:00:00" and "2013-12-14 23:59:59"
AND o.order_status=1
AND o3.order_created IS NULL -- This gets only missed joins on o3
GROUP BY
    o.order_id,
    o.order_status,
    o.order_created,
    o.user_id,
    i.identity_firstname,
    i.identity_email;

The last line is how you achieve the same as NOT IN (...) using a LEFT JOIN

Disclaimer: Not tested.

1
Kickstart On

Can't really comment on the results as you have not posted any table declares or example data, but your query has 3 correlated sub queries which is likely to make it perform poorly (OK, one of those is for last_order and is only for testing).

Eliminating the correlated sub queries and replacing them with joins would give something like this:-

SELECT o.order_id,
        o.order_status,
        o.order_created,
        o.user_id,
        i.identity_firstname,
        i.identity_email,
        Sub1.order_count,
        Sub2.last_order
FROM orders o
INNER JOIN user_identities ui ON o.user_id=ui.user_id
INNER JOIN identities i ON ui.identity_id=i.identity_id
   AND i.identity_email!=''
INNER JOIN subscribers s ON i.identity_id=s.identity_id
  AND s.subscriber_status=1
  AND s.subsriber_type=e
  AND s.subscription_id=1
LEFT OUTER JOIN
(
    SELECT user_id, COUNT(*) AS order_count
    FROM orders 
    WHERE order_status=1
    GROUP BY user_id
) Sub1
ON o.user_id = Sub1.user_id
LEFT OUTER JOIN
(
    SELECT user_id, MAX(order_created) as last_order
    FROM orders 
    WHERE order_status=1
    GROUP BY user_id
) AS Sub2
ON o.user_id = Sub2.user_id
LEFT OUTER JOIN
(
    SELECT DISTINCT user_id
    FROM orders 
    WHERE order_status=1
    AND DATE(order_created) > "2013-12-14"
) Sub3
ON o.user_id = Sub3.user_id
WHERE DATE(o.order_created) = "2013-12-14"
  AND o.order_status=1
  AND Sub3.user_id IS NULL