MySQL using IN/FIND_IN_SET to read multiple rows in sub query

8.7k views Asked by At

I have two tables, locations and location groups

CREATE TABLE locations (
    location_id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
    name VARCHAR(63) UNIQUE NOT NULL
);

INSERT INTO locations (name)
  VALUES
('london'),
('bristol'),
('exeter');

CREATE TABLE location_groups (
    location_group_id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
    location_ids VARCHAR(255) NOT NULL,
    user_ids VARCHAR(255) NOT NULL,
    name VARCHAR(63) NOT NULL,
);

INSERT INTO location_groups (location_ids, user_ids, name)
  VALUES
('1', '1,2,4', 'south east'),
('2,3', '2', 'south west');

What I am trying to do is return all location_ids for all of the location_groups where the given user_id exists. I'm using CSV to store the location_ids and user_ids in the location_groups table. I know this isn't normalised, but this is how the database is and it's out of my control.

My current query is:

SELECT location_id 
  FROM locations 
  WHERE FIND_IN_SET(location_id, 
      (SELECT location_ids 
         FROM location_groups 
         WHERE FIND_IN_SET(2,location_groups.user_ids)) )

Now this works fine if the user_id = 1 for example (as only 1 location_group row is returned), but if i search for user_id = 2, i get an error saying the sub query returns more than 1 row, which is expected as user 2 is in 2 location_groups. I understand why the error is being thrown, i'm trying to work out how to solve it.

To clarify when searching for user_id 1 in location_groups.user_ids the location_id 1 should be returned. When searching for user_id 2 the location_ids 1,2,3 should be returned.

I know this is a complicated query so if anything isn't clear just let me know. Any help would be appreciated! Thank you.

1

There are 1 answers

0
outis On BEST ANSWER

You could use GROUP_CONCAT to combine the location_ids in the subquery.

SELECT location_id 
FROM locations 
WHERE FIND_IN_SET(location_id, 
    (SELECT GROUP_CONCAT(location_ids)
     FROM location_groups 
     WHERE FIND_IN_SET(2,location_groups.user_ids)) )

Alternatively, use the problems with writing the query as an example of why normalization is good. Heck, even if you do use this query, it will run more slowly than a query on properly normalized tables; you could use that to show why the tables should be restructured.


For reference (and for other readers), here's what a normalized schema would look like (some additional alterations to the base tables are included).

The compound fields in the location_groups table could simply be separated into additional rows to achieve 1NF, but this wouldn't be in 2NF, as the name column would be dependent on only the location part of the (location, user) candidate key. (Another way of thinking of this is the name is an attribute of the regions, not the relations between regions/groups, locations and users.)

Instead, these columns will be split off into two additional tables for 1NF: one to connect locations and regions, and one to connect users and regions. It may be that the latter should be a relation between users and locations (rather than regions), but that's not the case with the current schema (which could be another problem of the current, non-normalized schema). The region-location relation is one-to-many (since each location is in one region). From the sample data, we see the region-user relation is many-many. The location_groups table then becomes the region table.

-- normalized from `location_groups`
CREATE TABLE regions (
    `id` INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
    `name` VARCHAR(63) UNIQUE NOT NULL
);

-- slightly altered from original
CREATE TABLE locations (
    `id` INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
    `name` VARCHAR(63) UNIQUE NOT NULL
);

-- missing from original sample
CREATE TABLE users (
    `id` INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
    `name` VARCHAR(63) UNIQUE NOT NULL
);

-- normalized from `location_groups`
CREATE TABLE location_regions (
    `region` INT UNSIGNED,
    `location` INT UNSIGNED UNIQUE NOT NULL,
    PRIMARY KEY (`region`, `location`),
    FOREIGN KEY (`region`)
        REFERENCES regions (id)
        ON DELETE restrict ON UPDATE cascade,
    FOREIGN KEY (`location`)
        REFERENCES locations (id)
        ON DELETE cascade ON UPDATE cascade
);

-- normalized from `location_groups`
CREATE TABLE user_regions (
    `region` INT UNSIGNED NOT NULL,
    `user` INT UNSIGNED NOT NULL,
    PRIMARY KEY (`region`, `user`),
    FOREIGN KEY (`region`)
        REFERENCES regions (id)
        ON DELETE restrict ON UPDATE cascade,
    FOREIGN KEY (`user`)
        REFERENCES users (id)
        ON DELETE cascade ON UPDATE cascade
);

Sample data:

INSERT INTO regions
  VALUES
('South East'),
('South West'),
('North East'),
('North West');

INSERT INTO locations (`name`)
  VALUES
('London'),
('Bristol'),
('Exeter'),
('Hull');

INSERT INTO users (`name`)
  VALUES
('Alice'),
('Bob'),
('Carol'),
('Dave'),
('Eve');

------ Location-Region relation ------
-- temporary table used to map natural keys to surrogate keys
CREATE TEMPORARY TABLE loc_rgns (
    `location` VARCHAR(63) UNIQUE NOT NULL
    `region` VARCHAR(63) NOT NULL,
);

-- Hull added to demonstrate correctness of desired query
INSERT INTO loc_rgns (region, location)
  VALUES
('South East', 'London'),
('South West', 'Bristol'),
('South West', 'Exeter'),
('North East', 'Hull');

-- map natural keys to surrogate keys for final relationship
INSERT INTO location_regions (`location`, `region`)
  SELECT loc.id, rgn.id
    FROM locations AS loc
      JOIN loc_rgns AS lr ON loc.name = lr.location
      JOIN regions AS rgn ON rgn.name = lr.region;

------ User-Region relation ------
-- temporary table used to map natural keys to surrogate keys
CREATE TEMPORARY TABLE usr_rgns (
    `user` INT UNSIGNED NOT NULL,
    `region` VARCHAR(63) NOT NULL,
    UNIQUE (`user`, `region`)
);

-- user 3 added in order to demonstrate correctness of desired query
INSERT INTO usr_rgns (`user`, `region`)
  VALUES
(1, 'South East'),
(2, 'South East'),
(2, 'South West'),
(3, 'North West'),
(4, 'South East');

-- map natural keys to surrogate keys for final relationship
INSERT INTO user_regions (`user`, `region`)
  SELECT user, rgn.id
    FROM usr_rgns AS ur
      JOIN regions AS rgn ON rgn.name = ur.region;

Now, the desired query for the normalized schema:

SELECT DISTINCT loc.id
FROM locations AS loc
  JOIN location_regions AS lr ON loc.id = lr.location
  JOIN user_regions AS ur ON lr.region = ur.region
;

Result:

+----+
| id |
+----+
|  1 |
|  2 |
|  3 |
+----+