Execution of SQL Statements inside Stored Procedure in Parallel in SnowFlake DB

3.4k views Asked by At

In SnowFlake, would there be any option to execute the sql statements in parallel inside a Stored Procedure. I've a Stored Procedure (shown below), which has 35 sql statements, which run sequentially. We are planning to reduce the time , thinking of executing all these in parallel.

What would be the best approach to achieve this? (All I could think of is create 35 Stored Procedures and call all of them from a scheduler same time). Wanted to check if there would be any better approach of any SnowFlake feature to achieve this?

create or replace procedure SP_TEST()
  returns string
  language javascript
  execute as CALLER
  as
  $$
    try {
      var step_num = 0
    step_num = 0
    step_num++ //--#1
       var sql_statement1   = snowflake.createStatement( {sqlText: `INSERT INTO TGT_TBL select * from TBL_A`} )
       var sql_statement1_execute = sql_statement1.execute() 
        step_num++ //--#1
      var sql_statement2   = snowflake.createStatement( {sqlText: `INSERT INTO TGT_TBL select * from TBL_B`} )
      var sql_statement2_execute = sql_statement2.execute() 
      return  "Completed "+step_num+" steps to load into TGT_TBL"
     }
   catch (err) {
        throw(err)
    }
  $$
;
4

There are 4 answers

2
Gokhan Atil On

As an alternative to your idea about creating 35 tasks, you can consider creating one task to call a stored procedure which will create 35 tasks to call these individual queries. If the query SQLs can be read from a table, or some part of them can be generated by JavaScript (such as table names with dates, ie CONNECT20200120 etc) then it could be a good automation.

0
Marcel On

Actually it's sequential because you are running the execute()-method for every statement.

To answer your question directly: Achieving parallelization within the single procedure is not possible and there is no feature or specific method as far as I know.

But in general: Increasing your warehouse-size is a good way of saving some time as the queries may run faster. It's not parallel but faster.

Scaling-out (i.e. using multi cluster warehouse) doesn't provide benefits in my opinion when you use one single procedure.

0
Nadzeya On

split your SP into several ones (which should run in parallel) and use SF dependent tasks for diff parts runs after main part is done.

1
Thierno On

If your goal is to populate table TGT_TBL with values from multiples tables [TBL_A, TBL_B,..., TBL_35]. I would recommend you to use a third-party tool like dbt, Informatica Cloud (IICS) or any tool you prefer from this list https://docs.snowflake.com/en/user-guide/ecosystem-etl.html.

In fact, Snowflake stored procedure or even tasks are limited and also difficult to monitor for full orchestration/integration pipelines.

Using IICS with Pushdown Optimization or dbt, you can use incremental load approach and execute all your 35 sql statements simultaneously on Snowflake with still the power of Snowflake virtual warehouses.