I have to load millions of records on my DB, said records need further processing according to some of their columns values. In particular if a row satisfies a condition I save that row in table B, otherwise I save it in table C.
According to documentation sqlldr with DIRECT PATH is very fast, anyway Loading rows like that does not trigger triggers. So, I came up with two solutions:
SOLUTION 1:
Using sql loader with DIRECT PATH=true to load all data in table A. Then calling a stored procedure to perform the actual filtering. I am not so sure about that but it would seem that in this case oracle actually performs multithreading behind the scenes.
SOLUTION 2:
Using sql loader with DIRECT PATH=false with triggers activated after insert on table A.in this case, for performance sake, I would need to explicitely perform multithreading by splitting my data file in multiple files and calling sql loader multiple times (by the way I have no idea on how to do that on a bash script...)
Which one leads to better performances?And is there so much performance difference between the two?
In situations where data is loading then requires processing/cleansing before adding to the core data my strong preference is to load it first into a staging table, and use the fastest available method, i.e. sqlldr with direct path. This gets the data loading done and dusted quickly. There are implications with direct paths with regards to logging and recovery, so best to keep that activity to a small a window as possible.
The staging data can then be processed with greater flexibility using stored procedures. Here you can split the records into batches and have the stored procedure process a particular batch via a parameter, and then have more than 1 stored proc running on discrete batches in parallel (assuming each record can be processed discretely). Add in things like process status (READY, CHECKING, FAILED, COMPLETED) to control the process and you have something flexible and more manageable. The last project I did where we had this we loaded 100s millions of rows from files, and then processed them in batches over several nights during the quiet batch periods.
Coding wise it's probably more work, but messing around with files and bad records and reload and avoiding duplicates via sqlldr is more cumbersome than as table data.