Is it really necessary to have GROUP BY in the SQL standard

2.2k views Asked by At

After writing SQL for a few years, I find it often annoying having to put the columns I am interested in SELECT and then specify them again in GROUP BY. I can't help thinking, why do we have to do that?

What is the reason that user has to be specific about which column to group by? Can't we just let the SQL engine assume if there is an aggregate function in SELECT, group by the rest non-aggregate columns?

This will be especially helpful and more concise when you have a large CASE WHEN in SELECT.

3

There are 3 answers

0
Gurwinder Singh On BEST ANSWER

Because they may not always match exactly.

For example, If I want find out the maximum number of books per category, I could do:

select max(cnt)
from (
    select count(*) as cnt
    from books
    group by category
    ) t;

In some DBs such as Oracle, you can even do this:

select max(count(*))
from books
group by category;

I don't really need to specify the category column as I don't need it.

A few databases such as Postgres support the use of aliases in the group by clause.

0
Gordon Linoff On

I happen to sort of agree with you. If someone wanted a more esoteric group by -- say, leaving out columns -- then they could use a subquery.

If I had to guess, the writers of SQL did not want to imbue so much power in aggregation functions. Your suggestion would mean that a function in the select is determining the definition of the row in the result set. Normally, select just determines the columns. That is, it is one thing for a query to fail syntactically because an aggregation is included without a group by. It is another thing for a function in the select to change the rows being output.

You can use window functions with select distinct. Although I don't recommend the syntax, you can do:

select distinct x, count(*) over (partition by x)
from t;

Well, this eliminates the group by, but you still have to repeat the grouping criteria in each window function.

0
Manngo On

You have to imagine that you’re working with two versions of the table here. For example:

SELECT …
FROM table
GROUP BY …;

First, remember that the SELECT is evaluated after the FROM anbd GROUP BY clauses. That means that what you can select is limited by the results of those clauses.

The thing is to imagine that the GROUP BY generates a new virtual table. This virtual table only has the following:

  • The grouped columns
  • Summaries (aggregates) of all columns
  • Nothing else

There will be one row of summaries for each distinct group.

If you want a particular column in your SELECT clause, it must be either a group column or a summary, since you can only SELECT from what’s available.

Even without a GROUP BY clause, there is an implicit GROUP BY () which results in a single row of summaries. Some DBMSs (not all) even allow you to add that, though it doesn’t change anything.

Note that the number of rows in the GROUP BY summary is affected by the number of columns you are grouping. In general, the number of rows will be something like (DISTINCT Group1)*(DISTINCT GROUP2) etc. That means you certainly don’t want to group by more than you really need.

Exceptions

Suppose you have a statement like this:

SELECT state, name, count(*)
FROM customers
GROUP BY state;

This, of course, would fail. What meaning would there be select the name where there are multiple values for one state?

However, MySQL, in traditional mode, will let you do just that: it will select one name to go with the state. However, there is no guarantee which one, so it’s of dubious value.

Suppose you you want to group by month. You may wish to display the month name, but sort by the month number. Here you will need to group by both. Using pseudo date functions:

SELECT monthname, count(*)
FROM data
GROUP BY monthname, monthnumber
ORDER BY monthnumber;

This is nothing more than a workaround. It takes advantage of the fact that for each month name there is exactly one month number, to there is no real further grouping. It’s just there to get both values in the GROUP BY virtual table.