I have a large hierarchy (2,500+ records) stored in Microsoft SQL Server (2019) using an adjacency list model (e.g., Id
, ParentId
). I am looking for an efficient approach to look up a record based on a particular path in the hierarchy. In other words, given a path (e.g. /Root/FolderA/SubfolderA
), I'd like to retrieve the Id
associated with the final node (i.e., SubfolderA
in this case).
Note: Node names are not globally unique. I.e., we can't just look for
SubfolderA
and assume it maps to/Root/FolderA/SubfolderA
. There may be multiple nodes namedSubfolderA
within the hierarchy.
Setup
Hierarchy
/Root
/FolderA
/SubfolderA
/SubfolderB
/FolderB
/SubfolderA
/SubfolderB
Structure
CREATE
TABLE [dbo].[Tree] (
[Id] INT NOT NULL PRIMARY KEY,
[ParentId] INT NULL,
[Name] VARCHAR(255) NOT NULL,
CONSTRAINT [FK_Hierarchy]
FOREIGN KEY (ParentId)
REFERENCES [Tree]([Id])
)
Data
INSERT INTO Tree VALUES (1, NULL, 'Root');
INSERT INTO Tree VALUES (2, 1, 'FolderA');
INSERT INTO Tree VALUES (3, 2, 'SubfolderA');
INSERT INTO Tree VALUES (4, 2, 'SubfolderB');
INSERT INTO Tree VALUES (5, 1, 'FolderB');
INSERT INTO Tree VALUES (6, 5, 'SubfolderA');
INSERT INTO Tree VALUES (7, 5, 'SubfolderB');
Naïve Approach
There are quite a few threads on how to convert an adjacency list to materialized paths, including:
- SQL - Convert non-null adjacency list to path
- Build Enumeration Path from Adjacency List in SQL
- Flatten Adjacency List Hierarchy To A List Of All Paths
- Load hierarchical data from MSSQL with recursive common table expressions
View
We can use one of these approaches to convert the entire adjacency list into materialized paths using an rCTE:
CREATE
VIEW [dbo].[MaterializedPaths]
WITH SCHEMABINDING
AS
WITH RCTE AS (
SELECT Id,
ParentId,
CAST('/' + Name AS VARCHAR(255)) AS Path
FROM [dbo].[Tree] root
WHERE root.Id = 1
UNION ALL
SELECT this.Id,
this.ParentId,
CAST(parent.Path + '/' + this.Name AS VARCHAR(255)) AS Path
FROM [dbo].[Tree] AS this
INNER JOIN RCTE parent
ON this.ParentId = parent.Id
)
SELECT Id,
Path
FROM RCTE as hierarchy
Output
This produces the following output:
Id Path
1 /Root
2 /Root/FolderA
3 /Root/FolderA/SubfolderA
4 /Root/FolderA/SubfolderB
5 /Root/FolderB
6 /Root/FolderB/SubfolderA
7 /Root/FolderB/SubfolderB
Query
We can filter that output using a simple WHERE
clause:
SELECT Id
FROM MaterializedPaths
WHERE Path = '/Root/FolderA/SubfolderA'
Problem
The naïve approach works fine. The problem is that it's incredibly inefficient—and, thus, slow—for querying large hierarchies, as it needs to dynamically reconstruct the entire set of materialized paths each call. In my case, that takes 8-9 seconds. Obviously, I could just store this data in a table and regenerate it via a trigger at any time the data changes. But I'd rather find a more efficient query and avoid the additional complexity.
Question
What is an efficient way of constructing this query? Or, at the risk of making this an XY problem, is there a way to limit the rCTE so that it only needs to evaluate the nodes in the hierarchy, instead of reconstructing the entire hierarchy each time?
Limiting the rCTE
There are a handful of approaches to limiting the scope of each recursive query so that it's only evaluating the relevant nodes in the hierarchy. A fairly efficient approach is to simply restrict the rCTE to records which the source path (let's call that
@Path
) starts with:This will limit the query to each record in your path:
Which can then be easily filtered down to the final record based on a simple
WHERE
clause:Packaging it as a function
We can combine this with the original rCTE into a function. Putting it all together, it might look like:
Querying by path
Given the above function, you can now query the adjacency list by simply passing the full path to the
GetIdFromPath()
function:Which, given the sample data from the original post, will return
3
.Performance
I've tested this approach against a table of comparable size, with 2,500 sample records, and it consistently executes well under a second, which is a dramatic improvement over the naïve approach. Obviously, you'll need to evaluate this against your own database and performance requirements to determine if its efficient enough.