How to find rows in SQL that end with the same string?

722 views Asked by At

I have a question similar to the one found here: How to find rows in SQL that start with the same string (similar rows)?, and this solution works in MySQL 5.6 but not 5.7.

I have a database (t) with multiple columns, the important ones being id and filepath, and what I am trying to accomplish is retrieving all the file paths which have the same last 5 characters. The following works in MySQL5.6, and the second SELECT works fine in 5.7:

SELECT id, filepath FROM t
WHERE SUBSTRING(filepath, -5) IN
(
  SELECT SUBSTRING(filepath, -5) 
  FROM t 
  GROUP BY SUBSTRING(filepath, -5)
  HAVING COUNT(*) > 1
)

But when I try to run it on 5.7 I get the error

Expression #1 of HAVING clause is not in GROUP BY clause and contains 
nonaggregated column 't.filepath' which is not functionally dependent on 
columns in GROUP BY clause; this is incompatible with 
sql_mode=only_full_group_by

Sample data:

id     filepath
1      /Desktop/file1.txt
2      /Desktop/file2.txt
3      /Desktop/file1.txt

and I would want to return the rows with id 1 and 3. How can I fix this for MySQL5.7?

EDIT: Also can anybody point me in the right direction for the SQL to remove the duplicates? So I would want to remove the entry for id 3 but keep the entry for id 1 and 2.

3

There are 3 answers

3
vrijdenker On BEST ANSWER

Please read the mysql documentation on the subject GROUP BY and sql_mode only_full_group_by (like your error message says): https://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html

I think changing the inner query to this might fix the problem:

SELECT SUBSTRING(filepath, -5) AS fpath
FROM t 
GROUP BY fpath
HAVING COUNT(fpath) > 1

Edit:

As to your question of why adding the "AS fpath" works: Adding the alias "fpath" is just a clean way to do this. The point of ONLY_FULL_GROUP_BY is that each field you use in the SELECT, HAVING, or ORDER BY must also be in the GROUP BY.

So I added the fpath-alias for multiple reasons:

  • For performance: The query you wrote had SUBSTRING(filepath, -5) twice, which is bad for performance. Mysql has to execute that SUBSTRING call twice, while in my case it has to do it only once (per row).
  • To fix the group-by issue: You had COUNT() in the having, but "" was not in your GROUP BY statement (I'm not even sure whether that would be possible). You had to count "something", so since "fpath" was in your SELECT and in your GROUP BY, using that as your COUNT() would fix the problem.
1
Bill Karwin On

I prefer not to put subqueries in an IN() predicate because MySQL tends to run the subquery many times.

You can write the query differently to put the subquery in the FROM clause as a derived table. That will make MySQL run the subquery just once.

SELECT id, filepath 
FROM (
  SELECT SUBSTRING(filepath, -5) AS suffix, COUNT(*) AS count
  FROM t 
  GROUP BY suffix
  HAVING count > 1
) AS t1
JOIN t AS t2 ON SUBSTRING(t2.filepath, -5) = t1.suffix

This is bound to do a table-scan though, so it's going to be a costly query. It can't use an index when doing a substring comparison like that.

To optimize this, you might create a virtual column with an index.

ALTER TABLE t 
  ADD COLUMN filepath_last VARCHAR(10) AS (SUBSTRING_INDEX(filepath, '/', -1)),
  ADD KEY (filepath_last);

Then you can query it like this, and at least the subquery uses an index:

SELECT id, filepath 
FROM (
  SELECT filepath_last, COUNT(*) AS count
  FROM t 
  GROUP BY filepath_last
  HAVING count > 1
) AS t1
STRAIGHT_JOIN t AS t2 ON t2.filepath_last = t1.filepath_last
0
amper On

The solution that ended up working for me was found here: Disable ONLY_FULL_GROUP_BY

I ran SELECT @@sql_mode then SET @@sql_mode = followed by a string containing all the values returned by the first query except for only_full_group_by, but I'm still interested in how this is to be accomplished without changing the SQL settings.