I need to find supersets within an association table.
You can consider bag to be a collection of item records. The relationship between the two is tracked in bag_item.
I need to know which bags are supersets of other bags: containing ALL the items and possibly one or more additional items.
bag
| id | name |
|---|---|
| 1 | Bag A |
| 2 | Bag B |
| 3 | Bag C |
| 4 | Bag D |
item
| id | name |
|---|---|
| 1 | Pencil |
| 2 | Pen |
| 3 | Eraser |
| 4 | Marker |
bag_item (association table)
| id | bag_id | item_id |
|---|---|---|
| 1 | 1 | 1 |
| 2 | 1 | 2 |
| 3 | 2 | 1 |
| 4 | 2 | 2 |
| 5 | 2 | 3 |
| 6 | 2 | 4 |
| 7 | 3 | 1 |
| 8 | 3 | 4 |
| 9 | 4 | 1 |
| 10 | 4 | 2 |
| 11 | 4 | 3 |
Database: MySQL 8
I have "base" bag identifiers and I want to get "superset" bag identifiers that contain ALL the items in my "base" bag (and more items if present).
If I query for ALL bags, the response should look something like this:
| base_bag_id | superset_bag_id |
|---|---|
| 1 | 2 |
| 1 | 4 |
| 3 | 2 |
| 4 | 2 |
But, I also want to be able to filter on the basis of the base_bag_id, e.g. 1 & 4.
I've tried something to the effect of the following:
select distinct base_bag_item.bag_id, superset_bag_item.bag_id
from bag_item as base_bag_item
join bag_item as superset_bag_item on superset_bag_item.item_id = base_bag_item.item_id
where superset_bag_item.bag_id != base_bag_item.bag_id
and base_bag_item.bag_id in (1, 4)
order by base_bag_item.bag_id, superset_bag_item.bag_id
However, this query will incorrectly produce bags that contain at least one of the quotes from the "base", e.g. base_bag_id 1 will have superset_bag_id 3 because both bags contain item_id 1.
How do I get my desired output?
This is classic Relational Division With Remainder, with the multiple divisors tweak (you want to get all bags for all other possible bags, not just for one bag).
There are a number of solutions.
A simple one, if rather inefficient, is a double
NOT EXISTS.A more efficient solution is to join everything up using a left-join, then group and check the count for missing values.
However, performance is still poor with a large number of bags and items. This can be significantly improved by limiting the superset bag candidates to those that contain at least one of the items in the base bag.
This can then be modified to filter out the base bags that we're interested in, e.g. base bag
idof 1 or 4.This solution selectively turns off semijoin optimization in MySQL for the superset filtering subquery. When semijoin optimization is enabled, adding the
WHERE base.bag_id IN (1, 4)statement results in the creation of a temporary table with all thebag_items in it. This defeats the purpose of the subquery and tanks performance. We avoid this with an optimizer hint.If you switch to
WHERE (base.bag_id = 1 OR base.bag_id = 4), semijoin optimization isn't performed on the superset subquery and the hint isn't needed.When making these kinds of optimization decisions, always check
EXPLAINfirst.If necessary, we can also remove the reference to
bagby using a window function over the firstbag_item.db<>fiddle