Optimizing sql queries for deleting duplicates in Monetdb

449 views Asked by At

I have a problem where I have a market data table with >100,000,000 rows and I need to search and remove duplicates where the symbol and totvol columns match but the serial_no is different.

I have tried the query below on both a single table and also using a copy of the table for reference, but unfortunately it takes up an enormous amount of heap space (>100G and counting, sometimes filling the harddrive to the brim and crashing my database)and time (>30 mins) and brings my server to a crawl (60-95% cpu usage on 32 cores!) which is unacceptable. Is there an efficient way to write this query to optimize the sql execution if i want to execute something like this regularly?

Normally I would partition the table somehow since duplicates for the most part are inserted adjacent or near each other, but since monetdb is a column store database partitioning this way also takes up a lot of heap space. The only helpful thing I have found to reduce the heap is by creating an entirely new table with a subset of the data (i.e. split alphabetically by symbol) which results in smaller column bat files and then running the operation on the small table, is there any way I can keep the large table in tact and write a query which operates on maybe 1,000,000 rows at a time?

The query:

delete from print_11_11 
where exists (Select a.serial_no 
              from print_11_11 as a, print_11_11 as b 
              where a.symbol=b.symbol 
              and a.totvol = b.totvol
              and a.serial_no>b.serial_no)

Some example data, rows 2 and 3 are duplicates of one another and rows 4-7 are all duplicates = by my critera, note the exseq may be the same or different, it does not matter which exseq value we keep when removing duplicates:

<table border="1"><tr BGCOLOR="#CCCCFF"><th>serial_no</th><th>ttime</th><th>symbol</th><th>vol</th><th>totvol</th><th>exseq</th></tr>
<tr><td>0</td><td>80017</td><td>T</td><td>200</td><td>200</td><td>133813</td></tr>
<tr><td>855</td><td>80017</td><td>T</td><td>42</td><td>242</td><td>133813</td></tr>
<tr><td>867</td><td>80017</td><td>T</td><td>42</td><td>242</td><td>136690</td></tr>
<tr><td>868</td><td>80210</td><td>T</td><td>100</td><td>342</td><td>136690</td></tr>
<tr><td>876</td><td>80211</td><td>T</td><td>100</td><td>442</td><td>136690</td></tr>
<tr><td>877</td><td>80211</td><td>T</td><td>100</td><td>442</td><td>136696</td></tr>
<tr><td>882</td><td>80211</td><td>T</td><td>100</td><td>442</td><td>136737</td></tr>
<tr><td>883</td><td>80213</td><td>T</td><td>2928</td><td>3370</td><td>136737</td></tr>
</table>

0

There are 0 answers