I have a dimensional (star schema) data model that supports BI on labor data. It includes an employee dimension table. This table is a type 2 slowly-changing dimension table, i.e., it stores history as separate rows. It currently has about 9k rows in it.
I have a need to create a new column on this employee dimension table (in a SQL view), which will contain a pipe-delimited list of each employee's "organization hierarchy" (list of supervisors). For example, if employee C reports to manager B who reports to manager A at time t1, then this new column will contain value 'A|B|C'. If employee reports to manager D who reports to manager A at time t2, then this new column will contain value 'A|D|C'.
I am using a recursive CTE for this new column. On some low-volume (9 rows), mocked-up data, this recursive CTE works just fine. However, when I run this recursive CTE on the actual data (again, only about 9k rows), it still hasn't returned a result set after 30 minutes. This is surprising, giving that the data volume is very low.
If it matters, this is being ran on Azure SQL MI. And, there is a clustered index on the table's primary key ([EmployeeHistoryKey]).
In looking at an estimated query execution plan, I noticed that the bulk of the query cost was from a table scan on the [ManagerID] column. So, I put a non-clustered index on the [ManagerID] column, but that didn't improve the performance. So, how can I improve the query to be more performant?
Below is my mocked-up data and query:
IF OBJECT_ID('SomeDB.some_schema.OrgHierarchyMockup', 'U') IS NOT NULL
DROP TABLE SomeDB.some_schema.OrgHierarchyMockup
;
CREATE TABLE
SomeDB.some_schema.OrgHierarchyMockup
(
EmployeeHistoryKey int
,EmployeeID char(1)
,ManagerID char(1)
,SomeAttribute char(1)
,RowEffectiveDate date
,RowExpirationDate date
)
;
INSERT INTO
SomeDB.some_schema.OrgHierarchyMockup
VALUES
(1, 'a', NULL, 'x', '2023-01-01', '2023-06-01')
,(2, 'a', NULL, 'y', '2023-06-02', NULL)
,(3, 'b', 'a', 'x', '2023-01-01', '2023-06-01')
,(4, 'b', 'a', 'y', '2023-06-02', NULL)
,(5, 'c', 'a', 'x', '2023-01-01', NULL)
,(6, 'd', 'b', 'x', '2023-01-01', NULL)
,(7, 'e', 'b', 'x', '2023-01-01', '2023-06-01')
,(8, 'e', 'c', 'x', '2023-06-02', NULL)
,(9, 'f', 'c', 'x', '2023-01-01', NULL)
;
WITH
traversed_hierarchy AS
(
SELECT --anchor member
EmployeeHistoryKey
,EmployeeID
,ManagerID
,RowEffectiveDate
,RowExpirationDate
,CAST(EmployeeID AS varchar(max)) AS OrgHierarchy --this is the org hierarchy
,EmployeeHistoryKey AS m_EmployeeHistoryKey --necessary to "de-dupe" the result set
FROM
SomeDB.some_schema.OrgHierarchyMockup
WHERE
ManagerID IS NULL
UNION ALL
SELECT --recursive member
s.EmployeeHistoryKey
,s.EmployeeID
,s.ManagerID
,s.RowEffectiveDate
,s.RowExpirationDate
,m.OrgHierarchy1 + '|' + s.EmployeeID
,m.EmployeeHistoryKey
FROM
SomeDB.some_schema.OrgHierarchyMockup AS s --direct subordinates
INNER JOIN --must be an inner join (not a left join) because we want the direct subordinates of the previously-fetched level in traversed_hierarchy
traversed_hierarchy AS m
ON
m.EmployeeID = s.ManagerID
)
,rownumbered AS
(
SELECT
EmployeeHistoryKey
,EmployeeID
,ManagerID
,RowEffectiveDate
,RowExpirationDate
,OrgHierarchy
,m_EmployeeHistoryKey
,ROW_NUMBER() OVER( --this will allow us to de-dupe the result set
PARTITION BY
EmployeeHistoryKey
ORDER BY
m_EmployeeHistoryKey
) AS RowNum
FROM
traversed_hierarchy
)
SELECT
EmployeeHistoryKey
,EmployeeID
,OrgHierarchy
,m_EmployeeHistoryKey
FROM
rownumbered
WHERE
RowNum = 1
ORDER BY
EmployeeID
,EmployeeHistoryKey
,m_EmployeeHistoryKey
;
Use temporary tables instead of the CTEs and that alone should give you better performance. SQL Server can use statistics and make better decisions when creating query plans when you use temporary tables. The more your data grow over time nested CTEs involved in joins with big tables the bigger the memory grants they will request, if the server gets under memory contention your CTEs requesting memory grants will hang. Do not use table variables also in that scenario you shared with us.
In addition, with temporary tables you can create indexes to support join operations involving the temporary tables. With CTEs you cannot do that.