Resolve performance problem in recursive CTE

124 views Asked by At

I have a dimensional (star schema) data model that supports BI on labor data. It includes an employee dimension table. This table is a type 2 slowly-changing dimension table, i.e., it stores history as separate rows. It currently has about 9k rows in it.

I have a need to create a new column on this employee dimension table (in a SQL view), which will contain a pipe-delimited list of each employee's "organization hierarchy" (list of supervisors). For example, if employee C reports to manager B who reports to manager A at time t1, then this new column will contain value 'A|B|C'. If employee reports to manager D who reports to manager A at time t2, then this new column will contain value 'A|D|C'.

I am using a recursive CTE for this new column. On some low-volume (9 rows), mocked-up data, this recursive CTE works just fine. However, when I run this recursive CTE on the actual data (again, only about 9k rows), it still hasn't returned a result set after 30 minutes. This is surprising, giving that the data volume is very low.

If it matters, this is being ran on Azure SQL MI. And, there is a clustered index on the table's primary key ([EmployeeHistoryKey]).

In looking at an estimated query execution plan, I noticed that the bulk of the query cost was from a table scan on the [ManagerID] column. So, I put a non-clustered index on the [ManagerID] column, but that didn't improve the performance. So, how can I improve the query to be more performant?

Below is my mocked-up data and query:

IF OBJECT_ID('SomeDB.some_schema.OrgHierarchyMockup', 'U') IS NOT NULL
    DROP TABLE SomeDB.some_schema.OrgHierarchyMockup
;

CREATE TABLE
    SomeDB.some_schema.OrgHierarchyMockup
(
    EmployeeHistoryKey int
    ,EmployeeID char(1)
    ,ManagerID char(1)
    ,SomeAttribute char(1)
    ,RowEffectiveDate date
    ,RowExpirationDate date
)
;

INSERT INTO
    SomeDB.some_schema.OrgHierarchyMockup
VALUES
    (1, 'a', NULL, 'x', '2023-01-01', '2023-06-01')
    ,(2, 'a', NULL, 'y', '2023-06-02', NULL)
    ,(3, 'b', 'a', 'x', '2023-01-01', '2023-06-01')
    ,(4, 'b', 'a', 'y', '2023-06-02', NULL)
    ,(5, 'c', 'a', 'x', '2023-01-01', NULL)
    ,(6, 'd', 'b', 'x', '2023-01-01', NULL)
    ,(7, 'e', 'b', 'x', '2023-01-01', '2023-06-01')
    ,(8, 'e', 'c', 'x', '2023-06-02', NULL)
    ,(9, 'f', 'c', 'x', '2023-01-01', NULL)
;

WITH
traversed_hierarchy AS
(
    SELECT --anchor member
        EmployeeHistoryKey
        ,EmployeeID
        ,ManagerID
        ,RowEffectiveDate
        ,RowExpirationDate
        ,CAST(EmployeeID AS varchar(max)) AS OrgHierarchy --this is the org hierarchy
        ,EmployeeHistoryKey AS m_EmployeeHistoryKey --necessary to "de-dupe" the result set
    FROM
        SomeDB.some_schema.OrgHierarchyMockup
    WHERE
        ManagerID IS NULL

    UNION ALL

    SELECT --recursive member
        s.EmployeeHistoryKey
        ,s.EmployeeID
        ,s.ManagerID
        ,s.RowEffectiveDate
        ,s.RowExpirationDate
        ,m.OrgHierarchy1 + '|' + s.EmployeeID
        ,m.EmployeeHistoryKey
    FROM
        SomeDB.some_schema.OrgHierarchyMockup AS s --direct subordinates
    INNER JOIN --must be an inner join (not a left join) because we want the direct subordinates of the previously-fetched level in traversed_hierarchy
        traversed_hierarchy AS m
        ON
            m.EmployeeID = s.ManagerID
)
,rownumbered AS
(
    SELECT
        EmployeeHistoryKey
        ,EmployeeID
        ,ManagerID
        ,RowEffectiveDate
        ,RowExpirationDate
        ,OrgHierarchy
        ,m_EmployeeHistoryKey
        ,ROW_NUMBER() OVER( --this will allow us to de-dupe the result set
            PARTITION BY
                EmployeeHistoryKey
            ORDER BY
                m_EmployeeHistoryKey
        ) AS RowNum
    FROM
        traversed_hierarchy
)
SELECT
    EmployeeHistoryKey
    ,EmployeeID
    ,OrgHierarchy
    ,m_EmployeeHistoryKey
FROM
    rownumbered
WHERE
    RowNum = 1
ORDER BY
    EmployeeID
    ,EmployeeHistoryKey
    ,m_EmployeeHistoryKey
;
1

There are 1 answers

3
Alberto Morillo On BEST ANSWER

Use temporary tables instead of the CTEs and that alone should give you better performance. SQL Server can use statistics and make better decisions when creating query plans when you use temporary tables. The more your data grow over time nested CTEs involved in joins with big tables the bigger the memory grants they will request, if the server gets under memory contention your CTEs requesting memory grants will hang. Do not use table variables also in that scenario you shared with us.

In addition, with temporary tables you can create indexes to support join operations involving the temporary tables. With CTEs you cannot do that.