Data filtering performance:sqlldr+triggers or sqlldr+stored procedure?

618 views Asked by At

I have to load millions of records on my DB, said records need further processing according to some of their columns values. In particular if a row satisfies a condition I save that row in table B, otherwise I save it in table C.

According to documentation sqlldr with DIRECT PATH is very fast, anyway Loading rows like that does not trigger triggers. So, I came up with two solutions:

SOLUTION 1:

Using sql loader with DIRECT PATH=true to load all data in table A. Then calling a stored procedure to perform the actual filtering. I am not so sure about that but it would seem that in this case oracle actually performs multithreading behind the scenes.

SOLUTION 2:

Using sql loader with DIRECT PATH=false with triggers activated after insert on table A.in this case, for performance sake, I would need to explicitely perform multithreading by splitting my data file in multiple files and calling sql loader multiple times (by the way I have no idea on how to do that on a bash script...)

Which one leads to better performances?And is there so much performance difference between the two?

2

There are 2 answers

2
TenG On BEST ANSWER

In situations where data is loading then requires processing/cleansing before adding to the core data my strong preference is to load it first into a staging table, and use the fastest available method, i.e. sqlldr with direct path. This gets the data loading done and dusted quickly. There are implications with direct paths with regards to logging and recovery, so best to keep that activity to a small a window as possible.

The staging data can then be processed with greater flexibility using stored procedures. Here you can split the records into batches and have the stored procedure process a particular batch via a parameter, and then have more than 1 stored proc running on discrete batches in parallel (assuming each record can be processed discretely). Add in things like process status (READY, CHECKING, FAILED, COMPLETED) to control the process and you have something flexible and more manageable. The last project I did where we had this we loaded 100s millions of rows from files, and then processed them in batches over several nights during the quiet batch periods.

Coding wise it's probably more work, but messing around with files and bad records and reload and avoiding duplicates via sqlldr is more cumbersome than as table data.

2
daivrz On

The only way you'll know about performance is to actually try the different approaches. For the first option, instead of using a procedure (which will possibly result in row by row processing) you could try straight SQL. See multi table inserts. This allows you to specify a condition to direct the inserts to different tables.

Edit

I've never actually written a multi table insert myself so I thought I'd try. It's quite neat.

  > CREATE TABLE t_stg
  (pk NUMBER PRIMARY KEY
  ,cond NUMBER
  ,text VARCHAR2(20))

  table T_STG created.

  > CREATE TABLE t_1 AS SELECT * FROM t_stg WHERE 1=2

  table T_1 created.

  > CREATE TABLE t_2 AS SELECT * FROM t_stg where 1=2

  table T_2 created.

  > INSERT INTO t_stg
  (SELECT LEVEL lvl 
        ,mod(LEVEL, 2) cond
        ,to_char(SYSDATE + LEVEL, 'Day') txt
  FROM dual
  CONNECT BY LEVEL < 8)

  7 rows inserted.

  > SELECT *
  FROM   t_stg

          PK       COND TEXT               
  ---------- ---------- --------------------
           1          1 Sunday               
           2          0 Monday               
           3          1 Tuesday              
           4          0 Wednesday            
           5          1 Thursday             
           6          0 Friday               
           7          1 Saturday             

   7 rows selected 

  > INSERT ALL
  WHEN cond = 1 THEN
     INTO t_1 (pk, cond, text) VALUES (pk*2, cond, text)
  ELSE
     INTO t_2 (pk, cond, text) VALUES (pk, cond, text)
  SELECT pk
        ,cond
        ,text
  FROM   t_stg

  7 rows inserted.

  > SELECT *
  FROM   t_1
          PK       COND TEXT               
  ---------- ---------- --------------------
           2          1 Sunday               
           6          1 Tuesday              
          10          1 Thursday             
          14          1 Saturday             

  > SELECT *
  FROM   t_2

          PK       COND TEXT               
  ---------- ---------- --------------------
           2          0 Monday               
           4          0 Wednesday            
           6          0 Friday