How do I convert a resource intensive self join to a faster query?

87 views Asked by At

This SQL query takes 90 minutes to run in DuckDB with my million row table. I am using a self join pattern as that's the only way I have found this query to work but I am sure there must be a faster query pattern.

  • The query starts with a Common Table Expression (CTE) that selects all columns from the main.openorder_tbl table and assigns a row number to each row based on the "Invoice Date" column.

  • The main query then performs a self join on the CTE where it joins each row (self) with all previous rows (other) that belong to the same "group" and meet certain conditions related to the "Invoice Date" and "Due Date".

  • The conditions for the join are:

  1. The "Invoice Date" of the self row is more than 90 days after the "Due Date" of the other row.
  2. The "Date Closed" of the other row is more than 90 days after its "Due Date".
  3. The "Due Date" of the other row is within 200 days before the "Invoice Date" of the self row.
  • For each self row, the query counts the number of other rows that meet these conditions.

  • The result is a table that includes all columns from the self row and the count of other rows for each self row.

  • The output is ordered by the row number of the self row.

This query essentially calculates a rolling count of rows in the same group that meet certain date-related conditions. The self join is necessary because I am comparing each row with all previous rows in the same group.

Example Input Table

row_number Invoice Date Due Date Date Closed days amount group
1 2022-10-22 2022-11-21 2024-03-01 466 111 B
2 2022-10-22 2022-11-21 2023-03-01 100 111 B
3 2022-10-24 2022-11-23 2023-03-03 100 150 B
4 2022-10-31 2022-11-30 2023-03-05 95 300 A
5 2022-11-10 2022-12-10 2023-02-02 54 180 A
6 2022-11-21 2022-12-21 2023-02-04 45 220 B
7 2022-12-04 2023-01-03 2024-01-04 366 210 B
8 2022-12-19 2023-01-18 2023-01-20 2 100 A
9 2023-01-01 2023-01-31 2023-02-20 20 200 B
10 2023-01-22 2023-02-21 2023-06-01 100 280 B
11 2023-02-28 2023-03-30 2023-04-02 3 250 A
12 2023-05-14 2023-06-13 2023-09-01 80 21 A
13 2023-06-18 2023-07-18 2023-10-01 75 456 A
14 2023-07-02 2023-08-01 2023-08-01 0 320 B

SQL

WITH cte AS (
    SELECT 
        ROW_NUMBER() OVER (ORDER BY "Invoice Date") AS row_number,
        "Invoice Date",
        "Due Date",
        "Date Closed",
        "days",
        "amount",
        "group"
    FROM main.openorder_tbl 
)
select
   first(
      columns(self.*)
   ),
   count(other."row_number")
from 
   cte self left join cte other 
on
   self."group" = other."group"
   and
   self."row_number" > other."row_number" 
    AND
     (self."Invoice Date" - other."Due Date") > 90
    AND 
    (other."Date Closed" - other."Due Date") > 90
    AND 
    other."Due Date" > (self."Invoice Date" -200)
    group by
   self."row_number"
order by 
   self."row_number"
   
   

Output

first(self.row_number) first(self."Invoice Date") first(self."Due Date") first(self."Date Closed") first(self."days") first(self.amount) first(self."group") count(other.row_number)
1 2022-10-22 2022-11-21 2024-03-01 466 111 B 0
2 2022-10-22 2022-11-21 2023-03-01 100 111 B 0
3 2022-10-24 2022-11-23 2023-03-03 100 150 B 0
4 2022-10-31 2022-11-30 2023-03-05 95 300 A 0
5 2022-11-10 2022-12-10 2023-02-02 54 180 A 0
6 2022-11-21 2022-12-21 2023-02-04 45 220 B 0
7 2022-12-04 2023-01-03 2024-01-04 366 210 B 0
8 2022-12-19 2023-01-18 2023-01-20 2 100 A 0
9 2023-01-01 2023-01-31 2023-02-20 20 200 B 0
10 2023-01-22 2023-02-21 2023-06-01 100 280 B 0
11 2023-02-28 2023-03-30 2023-04-02 3 250 A 0
12 2023-05-14 2023-06-13 2023-09-01 80 21 A 1
13 2023-06-18 2023-07-18 2023-10-01 75 456 A 0
14 2023-07-02 2023-08-01 2023-08-01 0 320 B 2

I have tried using a window function, but the problem is if I do the window function SQL won't let me use conditionals inside of it, this is of course how it's supposed to work. Maybe there's a faster way by doing a sub query with the window function or some such thing. I am not an SQL expert by any stretch. Unfortunately I am at a loss, I know I need to filter down the table as much as possible before doing the rolling count so it runs fast but not sure if there's a way to do that in this case.

Here is the Explain Analyze output from a larger table so the timings are more realistic, the column names are somewhat different but the logic is the same:

┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE
 ---Start CTE with index 
 WITH cte
 AS (SELECT ROW_NUMBER() OVER (ORDER BY "Invoice Date") AS "row_number",
            "Doc|PayItem",
            "Invoice Date",
            "Date Closed",
            "Due Date",
            "DaysPastDue",
            "Gross Amount",
            "ParentName"
     FROM main.df_CustLedger
    )
 ---select the first value of 
 select first(columns(self."Doc|PayItem")) as "Doc|PayItem",
        COALESCE(count(other."row_number"), 0) as "NumberOfPaidInvoices_std"
 --       COALESCE(sum(other."Gross Amount"), 0) as "TotalGrossAmtPaid_std",
 --       COALESCE(AVG(other."DaysPastDue"), 0) as "MeanDelay_std"
 from cte self
     left join cte other
         on self."ParentName" = other."ParentName"
            and self."row_number" > other."row_number"
            and (self."Invoice Date" - other."Due Date") > 90
            and (other."Date Closed" - other."Due Date") > 90
            and (other."Due Date" > (self."Invoice Date" -200))
 group by self."row_number"
 order by self."row_number"
 
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││        Total Time: 272.71s        ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐                             
│      RESULT_COLLECTOR     │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│             0             │                             
│          (0.00s)          │                             
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
│      EXPLAIN_ANALYZE      │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│             0             │                             
│          (0.00s)          │                             
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
│          ORDER_BY         │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│          ORDERS:          │                             
│           #3 ASC          │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│          1628109          │                             
│          (0.39s)          │                             
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
│         PROJECTION        │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│        Doc|PayItem        │                             
│        Doc|PayItem        │                             
│  NumberOfPaidInvoices_std │                             
│         row_number        │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│          1628109          │                             
│          (0.03s)          │                             
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
│       HASH_GROUP_BY       │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│             #0            │                             
│         first(#1)         │                             
│         first(#2)         │                             
│         count(#3)         │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│          1628109          │                             
│          (10.87s)         │                             
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
│         PROJECTION        │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│         row_number        │                             
│        Doc|PayItem        │                             
│        Doc|PayItem        │                             
│         row_number        │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│         119091014         │                             
│          (0.88s)          │                             
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
│     BLOCKWISE_NL_JOIN     │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│            LEFT           │                             
│  (((Invoice Date - 200) < │                             
│ Due Date) AND ((row_n...  │                             
│ row_number) AND (((Invoice│                             
│ Date - Due Date) > 90...  ├──────────────┐              
│(ParentName = ParentName)))│              │              
│             )             │              │              
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │              │              
│         119091014         │              │              
│         (962.51s)         │              │              
└─────────────┬─────────────┘              │                                           
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         PROJECTION        ││         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         row_number        ││         row_number        │
│        Doc|PayItem        ││        Doc|PayItem        │
│        Invoice Date       ││          Due Date         │
│         ParentName        ││         ParentName        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          1628109          ││           116421          │
│          (0.00s)          ││          (0.00s)          │
└─────────────┬─────────────┘└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         PROJECTION        ││         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             #0            ││             #1            │
│             #1            ││             #2            │
│             #2            ││             #3            │
│             #3            ││             #4            │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││             #5            │
│          1628109          ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          (0.00s)          ││           116421          │
│                           ││          (0.00s)          │
└─────────────┬─────────────┘└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│           WINDOW          ││           FILTER          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│ROW_NUMBER() OVER (ORDER BY││((Date Closed - Due Date) >│
│   Invoice Date ASC NULLS  ││             90)           │
│            LAST)          ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││        EC: 1628109        │
│          1628109          ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          (0.74s)          ││           116421          │
│                           ││          (0.01s)          │
└─────────────┬─────────────┘└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│       df_CustLedger       ││             #0            │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││             #1            │
│        Invoice Date       ││             #2            │
│        Doc|PayItem        ││             #3            │
│         ParentName        ││             #4            │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││             #5            │
│        EC: 1628109        ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││          1628109          │
│          1628109          ││          (0.00s)          │
│          (0.09s)          ││                           │
└───────────────────────────┘└─────────────┬─────────────┘                             
                             ┌─────────────┴─────────────┐
                             │           WINDOW          │
                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
                             │ROW_NUMBER() OVER (ORDER BY│
                             │   Invoice Date ASC NULLS  │
                             │            LAST)          │
                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
                             │          1628109          │
                             │          (1.82s)          │
                             └─────────────┬─────────────┘                             
                             ┌─────────────┴─────────────┐
                             │         SEQ_SCAN          │
                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
                             │       df_CustLedger       │
                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
                             │        Invoice Date       │
                             │        Doc|PayItem        │
                             │        Date Closed        │
                             │          Due Date         │
                             │         ParentName        │
                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
                             │        EC: 1628109        │
                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
                             │          1628109          │
                             │          (0.14s)          │
                             └───────────────────────────┘                             
                          

I welcome any ideas to optimize the query.

1

There are 1 answers

1
jqurious On BEST ANSWER

Writing it as a Correlated Subquery maxes out all my CPU cores at 100%.

It executes ~10x faster than the self-join approach. (for a 100_000 row sample)

duckdb.sql("""
WITH cte AS (
    SELECT 
        ROW_NUMBER() OVER (ORDER BY "Invoice Date") AS row_number,
        "Invoice Date",
        "Due Date",
        "Date Closed",
        "days",
        "amount",
        "group"
    FROM main.openorder_tbl
)
from cte other
select
   *,
   (
       from cte select count(*)
       where
          cte."Due Date" > (other."Invoice Date" - '200 days'::interval)
          and
          (cte."Date Closed" - cte."Due Date") > '90 days'::interval
          and
          (other."Invoice Date" - cte."Due Date") > '90 days'::interval
          and
          cte."group" = other."group"
          and
          other."row_number" > cte."row_number" 
   ) as count
order by row_number
""")
┌────────────┬─────────────────────┬─────────────────────┬─────────────────────┬───────┬────────┬─────────┬───────┐
│ row_number │    Invoice Date     │      Due Date       │     Date Closed     │ days  │ amount │  group  │ count │
│   int64    │      timestamp      │      timestamp      │      timestamp      │ int64 │ int64  │ varchar │ int64 │
├────────────┼─────────────────────┼─────────────────────┼─────────────────────┼───────┼────────┼─────────┼───────┤
│          1 │ 2022-10-22 00:00:00 │ 2022-11-21 00:00:00 │ 2024-03-01 00:00:00 │   466 │    111 │ B       │     0 │
│          2 │ 2022-10-22 00:00:00 │ 2022-11-21 00:00:00 │ 2023-03-01 00:00:00 │   100 │    111 │ B       │     0 │
│          3 │ 2022-10-24 00:00:00 │ 2022-11-23 00:00:00 │ 2023-03-03 00:00:00 │   100 │    150 │ B       │     0 │
│          4 │ 2022-10-31 00:00:00 │ 2022-11-30 00:00:00 │ 2023-03-05 00:00:00 │    95 │    300 │ A       │     0 │
│          5 │ 2022-11-10 00:00:00 │ 2022-12-10 00:00:00 │ 2023-02-02 00:00:00 │    54 │    180 │ A       │     0 │
│          6 │ 2022-11-21 00:00:00 │ 2022-12-21 00:00:00 │ 2023-02-04 00:00:00 │    45 │    220 │ B       │     0 │
│          7 │ 2022-12-04 00:00:00 │ 2023-01-03 00:00:00 │ 2024-01-04 00:00:00 │   366 │    210 │ B       │     0 │
│          8 │ 2022-12-19 00:00:00 │ 2023-01-18 00:00:00 │ 2023-01-20 00:00:00 │     2 │    100 │ A       │     0 │
│          9 │ 2023-01-01 00:00:00 │ 2023-01-31 00:00:00 │ 2023-02-20 00:00:00 │    20 │    200 │ B       │     0 │
│         10 │ 2023-01-22 00:00:00 │ 2023-02-21 00:00:00 │ 2023-06-01 00:00:00 │   100 │    280 │ B       │     0 │
│         11 │ 2023-02-28 00:00:00 │ 2023-03-30 00:00:00 │ 2023-04-02 00:00:00 │     3 │    250 │ A       │     0 │
│         12 │ 2023-05-14 00:00:00 │ 2023-06-13 00:00:00 │ 2023-09-01 00:00:00 │    80 │     21 │ A       │     1 │
│         13 │ 2023-06-18 00:00:00 │ 2023-07-18 00:00:00 │ 2023-10-01 00:00:00 │    75 │    456 │ A       │     0 │
│         14 │ 2023-07-02 00:00:00 │ 2023-08-01 00:00:00 │ 2023-08-01 00:00:00 │     0 │    320 │ B       │     2 │
├────────────┴─────────────────────┴─────────────────────┴─────────────────────┴───────┴────────┴─────────┴───────┤
│ 14 rows                                                                                               8 columns │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Further Info

In DuckDB, subqueries are always decorrelated. DuckDB uses a state-of-the-art subquery decorrelation algorithm as described in the Unnesting Arbitrary Queries paper. This allows all subqueries to be decorrelated and executed as a single, much more efficient, query.