Query with EXCEPT before GROUP BY fails

58 views Asked by At

The following query is attempting to exclude one array from the main array, then group results.

SELECT utc_offset, is_dst,
       ltrim(
               trim(string_agg(distinct (CASE WHEN abbrev NOT LIKE '+%' AND abbrev NOT LIKE '-%' AND abbrev != name THEN abbrev ELSE '' END), ' ')) ||
               ' ' || string_agg(name, ', ' ORDER BY name)
       )
FROM pg_timezone_names
WHERE name NOT LIKE 'posix/%'
  AND name NOT LIKE 'Etc/%'
  AND (lower(abbrev) <> abbrev)
  AND name NOT IN ('HST', 'Factory', 'GMT', 'GMT+0', 'GMT-0', 'GMT0', 'localtime', 'UCT', 'Universal', 'UTC', 'PST8PDT', 'ROK', 'W-SU', 'MST', 'CST6CDT')
EXCEPT
SELECT n.*, a.*
FROM   pg_timezone_names n 
JOIN   pg_timezone_abbrevs a ON  a.abbrev = n.name
WHERE  n.utc_offset <> a.utc_offset
GROUP BY utc_offset, is_dst
ORDER BY utc_offset, is_dst
;

is instead generating an error:

ERROR:  column "pg_timezone_names.utc_offset" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT utc_offset, is_dst,

The query works without the EXCEPT block. Clearly there is a syntactic problem - the GROUP BY (and by extension) ORDER BY verbs are not being seen.

How should this query then be cast?

3

There are 3 answers

0
Erwin Brandstetter On BEST ANSWER

Question

The stated objective is to ...

exclude one array from the main array, then group results.

I think you really want this:

"Exclude time zones where an abbreviation with identical name but different offset exists - before aggregating qualifying rows."

What's wrong with the query?

  • You really want to exclude rows before aggregating the string. Just adding parentheses won't fix it.
  • EXCEPT requires compatible SELECT lists left and right.
  • SELECT n.*, a.* is a repeat offender: it also disagrees with the GROUP BY clause - the immediate cause for the observed error message.

Proof of concept

You attempt would work like this:

SELECT utc_offset, is_dst,
       ltrim(
               trim(string_agg(distinct (CASE WHEN abbrev NOT LIKE '+%' AND abbrev NOT LIKE '-%' AND abbrev != name THEN abbrev ELSE '' END), ' ')) ||
               ' ' || string_agg(name, ', ' ORDER BY name)
       )
FROM  (
   SELECT utc_offset, is_dst, name, abbrev
   FROM   pg_timezone_names
   WHERE  name NOT LIKE 'posix/%'
   AND    name NOT LIKE 'Etc/%'
   AND    (lower(abbrev) <> abbrev)
   AND    name NOT IN ('HST', 'Factory', 'GMT', 'GMT+0', 'GMT-0', 'GMT0', 'localtime', 'UCT', 'Universal', 'UTC', 'PST8PDT', 'ROK', 'W-SU', 'MST', 'CST6CDT')
   EXCEPT
   SELECT n.utc_offset, n.is_dst, n.name, n.abbrev
   FROM   pg_timezone_names n 
   JOIN   pg_timezone_abbrevs a ON  a.abbrev = n.name
   WHERE  n.utc_offset <> a.utc_offset
   ) sub
GROUP BY utc_offset, is_dst
ORDER BY utc_offset, is_dst
;

But don't. Rather use this 100 % equivalent ...

Superior query

SELECT utc_offset, is_dst
     , concat_ws(' ', string_agg(DISTINCT abbrev, ' ' ORDER BY abbrev) FILTER (WHERE NOT abbrev ^@ ANY ('{+,-}') AND abbrev <> name)
                    , string_agg(name, ', ' ORDER BY name)) AS abbrevs_and_names
FROM   pg_timezone_names n
WHERE  lower(abbrev) <> abbrev
AND    NOT name ^@ ANY ('{posix/, Etc/}')
AND    name <> ALL ('{HST, Factory, GMT, GMT+0, GMT-0, GMT0, localtime, UCT, Universal, UTC, PST8PDT, ROK, W-SU, MST, CST6CDT}')
AND    NOT EXISTS (  -- !!!
   SELECT FROM pg_timezone_abbrevs a
   WHERE  a.abbrev = n.name
   AND    a.utc_offset <> n.utc_offset
   )
GROUP  BY utc_offset, is_dst
ORDER  BY utc_offset, is_dst
;

fiddle

NOT EXISTS is much easier than EXCEPT here. (It is most of the time.) See:

I threw in a couple other (optional) simplifications & optimizations:

About the "starts with" operator ^@ (requires at least Postgres 11):

About the ANY construct:

About null-safe concatenation with concat_ws():

About the aggregate FILTER clause:

1
Maimoona Abid On

In this situation, you're attempting to aggregate the results of two subqueries after applying the EXCEPT clause to them. To accomplish this, you can apply the EXCEPT first and then carry out the aggregate using a common table expression (CTE). Here is your updated query:

WITH cte AS (
    SELECT utc_offset, is_dst,
           ltrim(
               trim(string_agg(distinct (CASE WHEN abbrev NOT LIKE '+%' AND abbrev NOT LIKE '-%' AND abbrev != name THEN abbrev ELSE '' END), ' ')) ||
               ' ' || string_agg(name, ', ' ORDER BY name)
           ) AS agg_column
    FROM pg_timezone_names
    WHERE name NOT LIKE 'posix/%'
      AND name NOT LIKE 'Etc/%'
      AND (lower(abbrev) <> abbrev)
      AND name NOT IN ('HST', 'Factory', 'GMT', 'GMT+0', 'GMT-0', 'GMT0', 'localtime', 'UCT', 'Universal', 'UTC', 'PST8PDT', 'ROK', 'W-SU', 'MST', 'CST6CDT')
    GROUP BY utc_offset, is_dst
)
SELECT utc_offset, is_dst, agg_column
FROM cte
EXCEPT
SELECT n.utc_offset, n.is_dst, a.agg_column
FROM pg_timezone_names n 
JOIN pg_timezone_abbrevs a ON a.abbrev = n.name
WHERE n.utc_offset <> a.utc_offset
ORDER BY utc_offset, is_dst;

Hope it works :)

0
Kihara On

The error you're encountering is because of the EXCEPT clause, which combines the results of two queries. When you use EXCEPT, the column names and data types of the two queries must match. In your case, the first query is selecting specific columns (utc_offset, is_dst, and a concatenated string), while the second query is selecting all columns from both pg_timezone_names and pg_timezone_abbrevs tables using n.* and a.*. This mismatch in column selection is causing the error. To resolve this issue, update the query such that the SELECT clauses in both parts of the EXCEPT statement have the the same column names and data types. In both parts of the EXCEPT clause, you can specifically select the required columns.Here's how you can modify your query:

SELECT utc_offset, is_dst,
       ltrim(
               trim(string_agg(distinct (CASE WHEN abbrev NOT LIKE '+%' AND abbrev NOT LIKE '-%' AND abbrev != name THEN abbrev ELSE '' END), ' ')) ||
               ' ' || string_agg(name, ', ' ORDER BY name)
       )
FROM pg_timezone_names
WHERE name NOT LIKE 'posix/%'
  AND name NOT LIKE 'Etc/%'
  AND (lower(abbrev) <> abbrev)
  AND name NOT IN ('HST', 'Factory', 'GMT', 'GMT+0', 'GMT-0', 'GMT0', 'localtime', 'UCT', 'Universal', 'UTC', 'PST8PDT', 'ROK', 'W-SU', 'MST', 'CST6CDT')
EXCEPT
SELECT n.utc_offset, n.is_dst,
       ltrim(
               trim(string_agg(distinct (CASE WHEN a.abbrev NOT LIKE '+%' AND a.abbrev NOT LIKE '-%' AND a.abbrev != n.name THEN a.abbrev ELSE '' END), ' ')) ||
               ' ' || string_agg(n.name, ', ' ORDER BY n.name)
       )
FROM   pg_timezone_names n 
JOIN   pg_timezone_abbrevs a ON  a.abbrev = n.name
WHERE  n.utc_offset <> a.utc_offset
GROUP BY n.utc_offset, n.is_dst
ORDER BY n.utc_offset, n.is_dst;

Hope it helps.