Split a large query (2 days) into pieces to increase the speed in Postgres

2k views Asked by At

I'm running a query on a database that contains over 120 million rows and 200 columns. The dataset contains loan information where we have for each borrower about 15 rows (one observation per quarter) and the borrowerID is also the primary key. I'm also joining on this borrowerID to retrieve information from the previous quarter, I understand this makes my query quite complex. The database also contains a FK (Foreign Key) on Bank. The party that hosts the DB explained to me that if I write a function to run the query in parts, by bank, it would be much faster. Currently my query looks like this.

SELECT a.id.   
   a.arrears AS "Months in Arrears",
   a.balance AS "Original Balance",
   a.bank AS "Bank",
   b.arrears AS "Next Months in Arrears",
   b.balance AS "Next Original Balance",

dense_rank() OVER (PARTITION BY concat(a.bank, a.id), ("substring"(a.date::text, 1, 4) || 'Q'::text) || round(0.3333 + "substring"(a.date::text, 6, 2)::numeric / 3::numeric, 0) ORDER BY a.date DESC) AS "Duplicate",
FROM loandata a
LEFT JOIN loandata b ON a.bank::text = b.bank::text AND a.id::text = b.id::text AND round((b.date::date - a.date::date)::numeric / 30::numeric, 0) = 3::numeric
ALTER TABLE myview
OWNER TO me;
GRANT ALL ON TABLE myview TO me;

Could you help me to write a function that this query is executed by bank or maybe it makes sense to do it per id, as it is the primary key? I'm running this view to download and store the data in Tableau.

Many thanks in advance. Best, Tim

PS. The dense_rank function is there to find loans that have multiple observations per quarter.

The explain analyze results suggested in the comments can be found here: http://explain.depesz.com/s/K19 I used a subset of the data where I selected 1 bank.

2

There are 2 answers

0
joop On BEST ANSWER

I am not quite sure about the exact intention of the query, but using the date functions instead of computing in numeric types, and avoiding all the casts and concat()s will probably get you somewhere.

SELECT a.id
   , a.arrears AS "Months in Arrears"
   , a.balance AS "Original Balance"
   , a.bank AS "Bank"
   , b.arrears AS "Next Months in Arrears"
   , b.balance AS "Next Original Balance"

   , dense_rank() OVER (
      PARTITION BY a.bank, a.id, date_trunc( 'quarter' ,a.date )
      ORDER BY a.date DESC
      ) AS "Duplicate"
FROM loandata a
LEFT JOIN loandata b ON a.bank = b.bank
    AND a.id = b.id
    AND date_trunc( 'quarter', b.date ) 
      = date_trunc( 'quarter', a.date + 'quarter'::interval) )
        ;

BTW: don't use columns called date. It is a type name in SQL (it is highlighted in green in my editor)

2
David Aldridge On

I'd suggest modifying this:

LEFT JOIN loandata b ON a.bank::text = b.bank::text AND a.id::text = b.id::text AND round((b.date::date - a.date::date)::numeric / 30::numeric, 0) = 3::numeric

... to ...

LEFT JOIN loan data b
  ON a.bank = b.bank AND
     a.id   = b.id   AND
     b.date between (a.date + 75) and (a.date + 105)

Although, T'm not sure that your date logic is what you want it to be. If you can describe in words what you want then a better method might be available.

https://wiki.postgresql.org/wiki/Working_with_Dates_and_Times_in_PostgreSQL