Efficiently selecting rows where a row in a related table exists

85 views Asked by At

I've got a recurring pattern in a system that I'm currently working on, where, for example, I need to select all the users that have orders under a list of possible companies. Or needing to select users if a record of this user being flagged exists.

My users table contains 430,825 records, so this shouldn't really be that difficult to deal with. Right now I'm close, I have a query that get's that .047s execution time that I look for, but if I add one more piece to this, it gets real slow.

Here's my current query, the fast one:

select`UserID`
from`users`
where(`CompanyID`in('3e55d1bb-d8b6-11e4-b38f-b8ca3a83b4c8')
    or`UserID`in(select*
        from(select`UserID`
            from`invoices`
            where`CompanyID`in('3e55d1bb-d8b6-11e4-b38f-b8ca3a83b4c8')
            and`__Active`=1)`a`)
    or`UserID`in(select*
        from(select`UserID`
            from`quoterequests`
            where`CompanyID`in('3e55d1bb-d8b6-11e4-b38f-b8ca3a83b4c8')
            and`__Active`=1)`a`))
and(`UserID`in(select*
        from(select`UserID`
            from`userassociations`
            where`_Email`='[email protected]'
            and`__Active`=1)`a`))
and(`UserID`in(select*
        from(select`UserID`
            from`usercustomerflags`
            where`CustomerFlagID`in(10,27,17,1,2,3,4,5,6)
            and`__Active`=1)`a`)
    or not exists(select 1 
        from`usercustomerflags`
        where`__Active`=1 
        and`users`.`UserID`=`UserID`))
and`Deleted`=0 
order by`DateTimeAdded`desc 
limit 50;

(The extra select*from(...) is because of this https://stackoverflow.com/a/1434712/728236)

In the middle there, I'm pulling users by further by email address, while checking other related tables for emails that might be related to this user. Like, the next piece searches users including their CC addresses when quotes are sent out to customers.

select`UserID`
from`users`
where(`CompanyID`in('3e55d1bb-d8b6-11e4-b38f-b8ca3a83b4c8')
    or`UserID`in(select*
        from(select`UserID`
            from`invoices`
            where`CompanyID`in('3e55d1bb-d8b6-11e4-b38f-b8ca3a83b4c8')
            and`__Active`=1)`a`)
    or`UserID`in(select*
        from(select`UserID`
            from`quoterequests`
            where`CompanyID`in('3e55d1bb-d8b6-11e4-b38f-b8ca3a83b4c8')
            and`__Active`=1)`a`))
and(`UserID`in(select*
        from(select`UserID`
            from`userassociations`
            where`_Email`='[email protected]'
            and`__Active`=1)`a`)
    or`UserID`in(select*
        from(select`UserID`
            from`userquotesemails`
            where`Email`='[email protected]'
            and`__Active`=1)`a`))
and(`UserID`in(select*
        from(select`UserID`
            from`usercustomerflags`
            where`CustomerFlagID`in(10,27,17,1,2,3,4,5,6)
            and`__Active`=1)`a`)
    or not exists(select 1 
        from`usercustomerflags`
        where`__Active`=1 
        and`users`.`UserID`=`UserID`))
and`Deleted`=0 
order by`DateTimeAdded`desc 
limit 50;

I've added the alternate table to search for emails, but now the query takes 3.016s, which is way, way slower. It seems odd that, as I was building this query, that last part seemed to be the tipping point in performance here, what would be the cause of this?

The first and second explains, respectively

+----+--------------------+-------------------+--+----------------+---------------------------------------------------------------------------------------------+------------------------------+------+-----------------------+---+-------+---------------------------------+
|  1 | PRIMARY            | <subquery6>       |  | ALL            |                                                                                             |                              |      |                       |   | 0.00  | Using temporary; Using filesort |
|  1 | PRIMARY            | users             |  | eq_ref         | PRIMARY,UserID_UNIQUE,fk_users_1_idx,users_Customers                                        | PRIMARY                      |  144 | <subquery6>.UserID    | 1 | 50.00 | Using where                     |
|  6 | MATERIALIZED       | userassociations  |  | ref            | userassociations_UserID,userassociations__Email                                             | userassociations__Email      | 1026 | const                 | 3 | 10.00 | Using where                     |
| 10 | DEPENDENT SUBQUERY | usercustomerflags |  | ref            | usercustomerflags_UserID_idx                                                                | usercustomerflags_UserID_idx |  144 | sterling.users.UserID | 1 | 10.00 | Using where                     |
|  8 | DEPENDENT SUBQUERY | usercustomerflags |  | index_subquery | usercustomerflags_CustomerFlagID_idx,usercustomerflags_UserID_idx                           | usercustomerflags_UserID_idx |  144 | func                  | 1 | 4.95  | Using where                     |
|  4 | DEPENDENT SUBQUERY | quoterequests     |  | index_subquery | quoterequests_CompanyID,quoterequests_UserID,quoterequests__Latest,quoterequests_UserQuotes | quoterequests__Latest        |  145 | func                  | 2 | 5.00  | Using where                     |
|  2 | DEPENDENT SUBQUERY | invoices          |  | index_subquery | Invoice_UserID_idx,Invoice_CompanyID_idx,invoices_SampleRequests,invoices_LateOrdersBubble  | Invoice_UserID_idx           |  145 | func                  | 1 | 3.33  | Using where                     |
+----+--------------------+-------------------+--+----------------+---------------------------------------------------------------------------------------------+------------------------------+------+-----------------------+---+-------+---------------------------------+

+----+--------------------+-------------------+--+-----+---------------------------------------------------------------------------------------------+--------------------------------+------+-----------------------+--------+--------+-------------+
|  1 | PRIMARY            | users             |  | ref | fk_users_1_idx,users_Customers                                                              | users_Customers                |    4 | const                 | 227515 | 100.00 | Using where |
| 12 | DEPENDENT SUBQUERY | usercustomerflags |  | ref | usercustomerflags_UserID_idx                                                                | usercustomerflags_UserID_idx   |  144 | sterling.users.UserID |      1 | 10.00  | Using where |
| 10 | SUBQUERY           | usercustomerflags |  | ALL | usercustomerflags_CustomerFlagID_idx,usercustomerflags_UserID_idx                           |                                |      |                       |   3509 | 4.94   | Using where |
|  8 | SUBQUERY           | userquotesemails  |  | ref | userquotesemails_Email__Active,userquotesemails_UserID                                      | userquotesemails_Email__Active | 1027 | const,const           |      1 | 100.00 |             |
|  6 | SUBQUERY           | userassociations  |  | ref | userassociations_UserID,userassociations__Email                                             | userassociations__Email        | 1026 | const                 |      3 | 10.00  | Using where |
|  4 | SUBQUERY           | quoterequests     |  | ref | quoterequests_CompanyID,quoterequests_UserID,quoterequests__Latest,quoterequests_UserQuotes | quoterequests_CompanyID        |  144 | const                 |  16702 | 10.00  | Using where |
|  2 | SUBQUERY           | invoices          |  | ref | Invoice_UserID_idx,Invoice_CompanyID_idx,invoices_SampleRequests,invoices_LateOrdersBubble  | Invoice_CompanyID_idx          |  144 | const                 |  17678 | 10.00  | Using where |
+----+--------------------+-------------------+--+-----+---------------------------------------------------------------------------------------------+--------------------------------+------+-----------------------+--------+--------+-------------+

Also, I've tried using joins, e.g. joining the invoices table, etc. but then I get the issue of having duplicate user rows per each invoice or quoterequest the joins receive, and grouping/distinct & ordering the resulting data became immensely slow, in the minutes.

I've also tried just the "exists" version of the first query, as suggesting by the doc https://dev.mysql.com/doc/refman/5.7/en/subquery-optimization-with-exists.html like so

select`UserID`
from`users`
where(`CompanyID`in('3e55d1bb-d8b6-11e4-b38f-b8ca3a83b4c8')
    or exists(select 1 
        from`invoices`
        where`CompanyID`in('3e55d1bb-d8b6-11e4-b38f-b8ca3a83b4c8')
        and`__Active`=1 
        and`users`.`UserID`=`UserID`)
    or exists(select 1 
        from`quoterequests`
        where`CompanyID`in('3e55d1bb-d8b6-11e4-b38f-b8ca3a83b4c8')
        and`__Active`=1 
        and`users`.`UserID`=`UserID`))
and(exists(select 1 
        from`userassociations`
        where`_Email`='[email protected]'
        and`__Active`=1 
        and`users`.`UserID`=`UserID`))
and(exists(select 1 
        from`usercustomerflags`
        where`CustomerFlagID`in(10,27,17,1,2,3,4,5,6)
        and`__Active`=1 
        and`users`.`UserID`=`UserID`)
    or not exists(select 1 
        from`usercustomerflags`
        where`__Active`=1 
        and`users`.`UserID`=`UserID`))
and`Deleted`=0 
order by`DateTimeAdded`desc 
limit 50;

But that gets me up to 5.516s, so that definitely doesn't seem like the right direction to make.

What is the most efficient way to select data the way that I'm trying? Or do I need to restructure some of my tables to get the performance I'm looking for?


I've isolated the smallest sub-problem and bottle neck I think I have. Here's my lighter query

select`users`.`UserID`,`users`.`_Customer`
from`users`
left join`userassociations`on`userassociations`.`UserID`=`users`.`UserID`
and`userassociations`.`__Active`=1 
where(`users`.`Email`='[email protected]'
    or`userassociations`.`_Email`='[email protected]')
and`users`.`Deleted`=0 
order by`users`.`DateTimeAdded`desc 
limit 50;

And the explain

+---+--------+------------------+--+-----+--------------------------------------------------------+-------------------------+-----+-----------------------+--------+--------+-------------+
| 1 | SIMPLE | users            |  | ref | users_getemail_INDEX,unify_email_INDEX,users_Customers | users_Customers         |   4 | const                 | 221463 | 100.00 | Using where |
| 1 | SIMPLE | userassociations |  | ref | userassociations_UserID                                | userassociations_UserID | 144 | sterling.users.UserID |      1 | 100.00 | Using where |
+---+--------+------------------+--+-----+--------------------------------------------------------+-------------------------+-----+-----------------------+--------+--------+-------------+

This query takes about 1.5 seconds to execute


CREATE TABLE `users` (
  `UserID` char(36) COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '',
    ...
  `Email` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
    ...
  `DateTimeAdded` datetime DEFAULT NULL,
    ...
  `Deleted` int(1) NOT NULL DEFAULT '0',
    ...
  `_LatestInvoiceDateTimeAdded` datetime DEFAULT NULL,
  `_InvoiceCount` int(11) NOT NULL DEFAULT '0',
  `_Customer` varchar(512) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
    ...
  PRIMARY KEY (`UserID`),
  UNIQUE KEY `UserID_UNIQUE` (`UserID`),
    ...
  KEY `users_getemail_INDEX` (`Email`(191),`_InvoiceCount`,`_LatestInvoiceDateTimeAdded`,`DateTimeAdded`),
  KEY `unify_email_INDEX` (`Email`(191),`UserID`),
    ...
  KEY `users_Customers` (`Deleted`,`DateTimeAdded`),
    ...
  KEY `users_DateTimeAdded` (`DateTimeAdded`,`UserID`),
  FULLTEXT KEY `users_FULLTEXT__Customer` (`_Customer`),
    ...
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;


CREATE TABLE `userassociations` (
   `UserAssociationID` binary(16) NOT NULL,
   `UserID` char(36) COLLATE utf8mb4_unicode_ci NOT NULL,
   `AssociatedUserID` char(36) COLLATE utf8mb4_unicode_ci NOT NULL,
   `_Email` varchar(256) COLLATE utf8mb4_unicode_ci NOT NULL,
   `__UserID` char(36) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
   `__Active` tinyint(1) NOT NULL DEFAULT '1',
   `__Added` timestamp(6) NOT NULL DEFAULT CURRENT_TIMESTAMP(6),
   `__Updated` timestamp(6) NULL DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP(6),
   PRIMARY KEY (`UserAssociationID`),
   KEY `userassociations_UserID` (`UserID`),
   KEY `userassociations_AssociatedUserID` (`AssociatedUserID`),
   KEY `userassociations___UserID` (`__UserID`),
   KEY `userassociations__Email` (`_Email`),
   CONSTRAINT `userassociations_AssociatedUserID` FOREIGN KEY (`AssociatedUserID`) REFERENCES `users` (`UserID`) ON DELETE NO ACTION ON UPDATE NO ACTION,
   CONSTRAINT `userassociations_UserID` FOREIGN KEY (`UserID`) REFERENCES `users` (`UserID`) ON DELETE NO ACTION ON UPDATE NO ACTION,
   CONSTRAINT `userassociations___UserID` FOREIGN KEY (`__UserID`) REFERENCES `users` (`UserID`) ON DELETE NO ACTION ON UPDATE NO ACTION
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

Hmm... So it seemed like it was working, but I've found a pair of tables that it doesn't seem to be as efficient with, which is my users and invoices tables.

I've have these indexes:

users:    INDEX(`CompanyID`, `Deleted`, `DateTimeAdded`) 
invoices: INDEX(`UserID`, `__Active`) 
invoices: INDEX(`CompanyID`)
users:    INDEX(`UserID`, `Deleted`)

and the query

select`users`.`UserID`,`users`.`DateTimeAdded`
from`users`
join`invoices`on`invoices`.`UserID`=`users`.`UserID`
and`invoices`.`__Active`=1 
where`invoices`.`CompanyID`='3e55c8b4-d8b6-11e4-b38f-b8ca3a83b4c8'
and`users`.`Deleted`=0 
order by`DateTimeAdded`desc 
limit 200;

This query alone takes .3 seconds, which feels slow to me, like it isn't making the best use of the indexes, especially because users only has 430,997 rows and invoices only has 194,180, and this looks like it should be a pretty simple query.

EDIT: Actually it's much worse than that, if the CompanyID given only includes ~4 rows, then this query takes 3.5 seconds

+---+--------+----------+--+-----+------------------------------------------------+-----------------------+-----+--------------------------------+------+--------+----------------------------------------------+
| 1 | SIMPLE | invoices |  | ref | Invoice_CompanyID_idx,invoices_UserID___Active | Invoice_CompanyID_idx | 144 | const                          | 7750 | 10.00  | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | users    |  | ref | users_UserID_Deleted                           | users_UserID_Deleted  | 148 | sterling.invoices.UserID,const |    1 | 100.00 |                                              |
+---+--------+----------+--+-----+------------------------------------------------+-----------------------+-----+--------------------------------+------+--------+----------------------------------------------+
2

There are 2 answers

11
Rick James On BEST ANSWER

For that smaller problem:

( select u.`UserID`, u.`_Customer`, u.DateTimeAdded
    from  `users` AS u
    where  u.`Email` = '[email protected]'
      and  u.`Deleted` = 0
      AND EXISTS ( SELECT * FROM `userassociations`
                       WHERE UserId = u.UserID
                         AND __Active = 1 )
    order by  u.`DateTimeAdded` desc
    limit  50
)
UNION DISTINCT
( select u.`UserID`, u.`_Customer`, u.DateTimeAdded
    from  `users` AS u
    JOIN  `userassociations` AS ua
         ON  ua.`UserID` = u.`UserID`
        and  ua.`__Active` = 1
    where  ua.`_Email` = '[email protected]' 
      and  u.`Deleted`=0
    order by  u.`DateTimeAdded` desc
    limit  50
)
order by `DateTimeAdded` desc
limit  50

These will be needed:

u:  INDEX(Email, Deleted, DateTimeAdded)  -- date last
ua: INDEX(UserId, __Active)   -- either order
ua: INDEX(_Email)
u:  INDEX(UserID, Deleted)

(Let me know if you get syntax errors. If it is too slow, please provide EXPLAIN.)

Index prefixing (Email(191)) is usually useless. Get rid if it. Here are 5 ways to avoid it: http://mysql.rjweb.org/doc.php/limits#767_limit_in_innodb_indexes

A PK is a UNIQUE key, so get rid of the second of:

PRIMARY KEY (`UserID`),
UNIQUE KEY `UserID_UNIQUE` (`UserID`),

Smells like a UUID; use ascii (ascii_general_ci), not utf8mb4:

... char(36) COLLATE utf8mb4_unicode_ci

INT(1) takes 4 bytes; use TINYINT for flags.

7
digital.aaron On

Pretty sure you can replace all those subqueries in the WHERE clause with a combination of INNER and LEFT JOINs. Try this:

SELECT u.UserID 
FROM users u
INNER JOIN userassociations ua ON ua.userid = u.userid
LEFT JOIN usercustomerflags uc ON uc.userid = u.userid AND uc.__Active =1
LEFT JOIN invoices i ON i.userid = u.userid AND i.__Active =1
LEFT JOIN quoterequests q ON q.userid = u.userid AND q.__Active =1
WHERE ua._Email ='[email protected]'
AND ua.__Active =1
AND (uc.userid IS NULL 
        OR (uc.userid IS NOT NULL AND uc.CustomerFlagID IN (10,27,17,1,2,3,4,5,6))
    )
AND (u.CompanyID = '3e55d1bb-d8b6-11e4-b38f-b8ca3a83b4c8'
        OR (i.CompanyID = '3e55d1bb-d8b6-11e4-b38f-b8ca3a83b4c8' AND i.userid IS NOT null)
        OR (q.CompanyID = '3e55d1bb-d8b6-11e4-b38f-b8ca3a83b4c8' AND q.userid IS NOT null)
    )
AND u.Deleted =0 
ORDER BY u.DateTimeAdded desc 
LIMIT 50;