SELECT Unaggregated Variables, Functionally Dependent on GROUPing Variables

133 views Asked by At

Background

I am working in a very restricted T-SQL environment, where one may only define the "body" of a VIEW: presumably the ... in

CREATE VIEW My_View AS ...

under the hood. Here is my @@VERSION:

Microsoft SQL Server 2019 (RTM-CU19) (KB5023049) - 15.0.4298.1 (X64)
        Jan 27 2023 16:44:09
        Copyright (C) 2019 Microsoft Corporation
        Web Edition (64-bit) on Linux (Amazon Linux 2)

I should note that these tables are synced from flat files, and as such, no formal schematic structure is preserved from the original source. That is, all "functional dependencies" are merely inferred (albeit reliably) from the column names and business concepts.

Problem

Suppose I have the following table called My_Measures...

Person_ID Name Measure
1 Greg 0
1 Greg 10
2 Devon 20
2 Devon 30

...where Name is functionally dependent on Person_ID.

The Usual

Now suppose I wish to aggregate the Measure into various summary statistics for each person. This is simple enough in SQL...

SELECT
    Person_ID,
    MIN(Measure) AS Min_Measure,
    MAX(Measure) AS Max_Measure,
    AVG(Measure) AS Avg_Measure
FROM 
    My_Measures
GROUP BY 
    Person_ID

...and yields the following results:

Person_ID Min_Measure Max_Measure Avg_Measure
1 0 10 5
2 20 30 25

The Twist

But suppose I wish to include Name alongside each Person_ID, like so:

Person_ID Name Min_Measure Max_Measure Avg_Measure
1 Greg 0 10 5
2 Devon 20 30 25

Obviously, the following attempt...

SELECT
    Person_ID,
--  ⌄⌄⌄⌄⌄
    Name,
--  ^^^^^
    MIN(Measure) AS Min_Measure,
    MAX(Measure) AS Max_Measure,
    AVG(Measure) AS Avg_Measure
FROM 
    My_Measures
GROUP BY 
    Person_ID

...will fail with this error:

Column 'My_Measures.Name' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

Attempts

I have found several unsatisfactory ways to yield the intended output.

(1) GROUP BY Dependent Variables

One way is to GROUP BY the Name column after Person_ID; and more generally, to append the dependent variable(s) at the end of the GROUP BY clause:

SELECT
    Person_ID,
--  ⌄⌄⌄⌄⌄
    Name,
--  ^^^^^
    MIN(Measure) AS Min_Measure,
    MAX(Measure) AS Max_Measure,
    AVG(Measure) AS Avg_Measure
FROM
    My_Measures
GROUP BY
--           ⌄⌄⌄⌄⌄⌄
    Person_ID, Name
--           ^^^^^^

This leaves the grouping unaltered, since the "real" grouping variable(s) (here Person_ID) have already defined it, and the dependent variables are merely "tagging along". However, this wastes performance on processing (arbitrarily many) dependent variables, which may be more complex (CHAR strings for Name) for indexing purposes.

(2) "Aggregate" Dependent Variables

Another way is to "aggregate" the Name column, with some function (like MIN()) that gives us one representative value (like 'Greg') from many identical duplicates (like ('Greg', 'Greg')).

SELECT
    Person_ID,
--  ⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄
    MIN(Name)    AS Name,
--  ^^^^^^^^^^^^^^^^^^^^^
    MIN(Measure) AS Min_Measure,
    MAX(Measure) AS Max_Measure,
    AVG(Measure) AS Avg_Measure
FROM 
    My_Measures
GROUP BY 
    Person_ID

This likewise achieves the intended result, but it likewise wastes performance on computing an aggregation of many identical values. Furthermore, it will only work on values that are comparable and thus have a MIN(); but it will obviously fail for non-comparable datatypes.

(3) ReJOIN After Aggregating

Perhaps the most disappointing way is to simply calculate the aggregates, and then reassociate the Person_ID with its Name via a JOIN:

-- Aggregate by ID.
WITH agg AS(
    SELECT
        Person_ID,
        MIN(Measure) AS Min_Measure,
        MAX(Measure) AS Max_Measure,
        AVG(Measure) AS Avg_Measure
    FROM 
        My_Measures
    GROUP BY 
        Person_ID
    
-- Deduplicate names for the JOIN. Given functional dependency, DISTINCT suffices.
), msr AS (
    SELECT DISTINCT
        Person_ID,
        Name
    FROM My_Measures
    
-- Reassociate the names with their IDs.
) SELECT
    agg.Person_ID   AS Person_ID,
--  ⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄
    msr.Name        AS Name,
--  ^^^^^^^^^^^^^^^^^^^^^^^^
    agg.Min_Measure AS Min_Measure,
    agg.Max_Measure AS Max_Measure,
    agg.Avg_Measure AS Avg_Measure
FROM
--      ⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄
    agg INNER JOIN msr
--      ^^^^^^^^^^^^^^
    ON agg.Person_ID = msr.Person_ID

Obviously, this squanders significant resources on an unnecessary JOIN and multiple CTEs, all to to recover data (like Name) that we originally had!

(4) Take FIRST_VALUE() Over a PARTITION

I have searched for some equivalent in T-SQL of the first() function in R. In SQL, such a FIRST() would simply select the very first value ('Greg') from many identical duplicates (('Greg', 'Greg')) within a GROUP, without requiring any costly computations. Furthermore, this would work regardless of comparability.

I have stumbled across the FIRST_VALUE() function, but that seems to require a PARTITION for each usage, and — in my relative inexperience with optimizing PARTITIONs — I worry about the impact on performance, if many dependent variables must be SELECTed.

It also looks ugly. ¯\(ツ)

Question

What is the best way to SELECT an arbitrary set of dependent variables (like Name) alongside the grouping variables (like Person_ID)? Please prioritize performance, but also consider elegance and canonicity and finally extensibility: this should ideally work for all datatypes, even the non-comparable.

0

There are 0 answers