Find supersets of association database table

87 views Asked by At

I need to find supersets within an association table.

You can consider bag to be a collection of item records. The relationship between the two is tracked in bag_item.

I need to know which bags are supersets of other bags: containing ALL the items and possibly one or more additional items.

bag

id name
1 Bag A
2 Bag B
3 Bag C
4 Bag D

item

id name
1 Pencil
2 Pen
3 Eraser
4 Marker

bag_item (association table)

id bag_id item_id
1 1 1
2 1 2
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 3 4
9 4 1
10 4 2
11 4 3

Database: MySQL 8

I have "base" bag identifiers and I want to get "superset" bag identifiers that contain ALL the items in my "base" bag (and more items if present).

If I query for ALL bags, the response should look something like this:

base_bag_id superset_bag_id
1 2
1 4
3 2
4 2

But, I also want to be able to filter on the basis of the base_bag_id, e.g. 1 & 4.

I've tried something to the effect of the following:

select distinct base_bag_item.bag_id, superset_bag_item.bag_id
from bag_item as base_bag_item
join bag_item as superset_bag_item on superset_bag_item.item_id = base_bag_item.item_id
where superset_bag_item.bag_id != base_bag_item.bag_id
and base_bag_item.bag_id in (1, 4)
order by base_bag_item.bag_id, superset_bag_item.bag_id

However, this query will incorrectly produce bags that contain at least one of the quotes from the "base", e.g. base_bag_id 1 will have superset_bag_id 3 because both bags contain item_id 1.

How do I get my desired output?

3

There are 3 answers

6
Charlieface On BEST ANSWER

This is classic Relational Division With Remainder, with the multiple divisors tweak (you want to get all bags for all other possible bags, not just for one bag).

There are a number of solutions.


A simple one, if rather inefficient, is a double NOT EXISTS.

SELECT
  base.id AS base_bag_id,
  s.id AS superset_bag_id
FROM bag base
JOIN bag s
   ON s.id <> base.id
  AND NOT EXiSTS (SELECT 1
    FROM bag_item bi
    WHERE bi.bag_id = base.id
      AND NOT EXISTS (SELECT 1
        FROM bag_item si
        WHERE si.item_id = bi.item_id
          AND si.bag_id = s.id
    )
);

A more efficient solution is to join everything up using a left-join, then group and check the count for missing values.

SELECT
  base.bag_id AS base_bag_id,
  s.id AS superset_bag_id
FROM bag_item base
JOIN bag s ON s.id <> base.bag_id
LEFT JOIN bag_item si
   ON si.bag_id = s.id
  AND si.item_id = base.item_id
GROUP BY
  base.bag_id,
  s.id
HAVING COUNT(si.item_id) = COUNT(*);

However, performance is still poor with a large number of bags and items. This can be significantly improved by limiting the superset bag candidates to those that contain at least one of the items in the base bag.

SELECT
  base.bag_id AS base_bag_id,
  s.id AS superset_bag_id
FROM bag_item base
JOIN bag s
  ON s.id <> base.bag_id
  AND s.id IN (
    SELECT DISTINCT overlap_bag_item.bag_id
    FROM bag_item AS base_bag_item_mirror
    JOIN bag_item AS overlap_bag_item
      ON overlap_bag_item.item_id = base_bag_item_mirror.item_id
    WHERE base_bag_item_mirror.bag_id = base.bag_id)
LEFT JOIN bag_item si
  ON si.bag_id = s.id
  AND si.item_id = base.item_id
GROUP BY
  base.bag_id,
  s.id
HAVING COUNT(si.item_id) = COUNT(*);

This can then be modified to filter out the base bags that we're interested in, e.g. base bag id of 1 or 4.

SELECT
  base.bag_id AS base_bag_id,
  s.id AS superset_bag_id
FROM bag_item base
JOIN bag s
  ON s.id <> base.bag_id
  AND s.id IN (
    SELECT /*+ NO_SEMIJOIN(MATERIALIZATION) */ DISTINCT overlap_bag_item.bag_id
    FROM bag_item AS base_bag_item_mirror
    JOIN bag_item AS overlap_bag_item
      ON overlap_bag_item.item_id = base_bag_item_mirror.item_id
    WHERE base_bag_item_mirror.bag_id = base.bag_id)
LEFT JOIN bag_item si
  ON si.bag_id = s.id
  AND si.item_id = base.item_id
WHERE base.bag_id IN (1, 4)
GROUP BY
  base.bag_id,
  s.id
HAVING COUNT(si.item_id) = COUNT(*);

This solution selectively turns off semijoin optimization in MySQL for the superset filtering subquery. When semijoin optimization is enabled, adding the WHERE base.bag_id IN (1, 4) statement results in the creation of a temporary table with all the bag_items in it. This defeats the purpose of the subquery and tanks performance. We avoid this with an optimizer hint.

If you switch to WHERE (base.bag_id = 1 OR base.bag_id = 4), semijoin optimization isn't performed on the superset subquery and the hint isn't needed.

When making these kinds of optimization decisions, always check EXPLAIN first.


If necessary, we can also remove the reference to bag by using a window function over the first bag_item.

SELECT
  base.bag_id AS base_bag_id,
  si.bag_id AS superset_bag_id
FROM (
    SELECT *,
      COUNT(*) OVER (PARTITION BY base.bag_id) AS count
    FROM bag_item base
) base
LEFT JOIN bag_item si
   ON si.item_id = base.item_id
  AND si.bag_id <> base.bag_id
GROUP BY
  base.bag_id,
  si.bag_id
HAVING COUNT(*) = MIN(base.count);

db<>fiddle

2
YUSOF KHAN On

Try this solution please:

SELECT 
    base_bag.bag_id AS base_bag_id, 
    superset_bag.bag_id AS superset_bag_id
FROM 
    (SELECT bag_id FROM bag_item GROUP BY bag_id) base_bag
JOIN 
    (SELECT bag_id FROM bag_item GROUP BY bag_id) superset_bag
ON base_bag.bag_id <> superset_bag.bag_id
WHERE NOT EXISTS (
    SELECT item_id 
    FROM bag_item base_items 
    WHERE base_items.bag_id = base_bag.bag_id 
    AND NOT EXISTS (
        SELECT 1
        FROM bag_item superset_items
        WHERE superset_items.bag_id = superset_bag.bag_id 
        AND superset_items.item_id = base_items.item_id
    )
)
AND (
    SELECT COUNT(DISTINCT item_id) FROM bag_item WHERE bag_id = superset_bag.bag_id
) > (
    SELECT COUNT(DISTINCT item_id) FROM bag_item WHERE bag_id = base_bag.bag_id
)
ORDER BY base_bag_id, superset_bag_id;
1
Akina On

But, I also want to be able to filter on the basis of the base_bag_id, e.g. 1 & 4.

Assuming that this means "find a bug which contains all items for each specified bag":

SET @base_bags = '1,4';
WITH 
minimal_superset AS (
  SELECT DISTINCT item_id
  FROM bag_item
  WHERE FIND_IN_SET(bag_id, @base_bags)
  ),
bags_list AS (
  SELECT DISTINCT bag_id
  FROM bag_item
  )
SELECT @base_bags base_bags_list, bag_id superset_bag_id
FROM minimal_superset
CROSS JOIN bags_list
LEFT JOIN bag_item USING (bag_id, item_id)
GROUP BY bag_id
HAVING NOT SUM(bag_item.item_id IS NULL)
base_bags_list superset_bag_id
1,4 2
1,4 4

fiddle

If the bags in the list must be not included into the superset bags list (bag #4 in the example) then add WHERE NOT FIND_IN_SET(bag_id, @base_bags) to CTE bags_list.