ETL choice, building an ETL that deals with SQL query engine (impala) or native database directly?

Question

ETL choice, building an ETL that deals with SQL query engine (impala) or native database directly?

196 views Asked by Atheer Abdullatif At 16 April 2021 at 22:52

I am trying to build an ETL that map the source tables to a dimensional, star schema model

our data warehouse is basically Impala on top of Kudu database

my question is, should I:

A- build an ETL that deals with kudu tables directly using Python (link)

or

B- or create UDFs (equivalent to stored procedures in SQL) in impala that does the insertion/joins etc to map source tables to star-schema model, and schedule it using Nifi or any scheduler such as Airflow etc

In my opinion, I think it would be better to deal with the native database rather than dealing with the SQL engine on top of it. but it is just an assumption.

Original Q&A

There are 1 answers

**Koushik Roy** · Accepted Answer · 2021-04-17T04:46:18+00:00

Why not approach C, :) a bit of both.

Both has pros and cons.

A - use python to build ETL - pros - better control, flexible to do any logic you want. cons - you have to code in python and code in sql. If something fails, it will be a nightmare to do RCA. Maintenance may be harder in comparison. - performance wise, this approach will be poorer in case of huge volume of data.
B - Use SQL to fetch data directly - pros - faster performance. less coding. cons - difficult to implement complex logic. Maintenance of code and schedule may be hard.

In addition to above, pls consider, your/teams comfort on python/SQL and future maintainability.
Currently we are using approach B in my cloudera project. We create views and then use insert to load final tables directly. We hardly need any UDF.
Now, my recommendation, please use approach B. And use approach A only in case you really can not create complex logic.

EDIT : Lets say, we have to load orders table. So we execute following blocks to load orders and dependent org,cust,prod tables.

Load customer   |
load org        | --> Load Orders final.
load product    |
load order stage|

Load customer block is collection of scripts like-

insert overwrite cust_stg select * from cust_stg_vw; -- This loads into stage table
insert overwrite cust select * from cust_vw; -- This loads into cust table

And similarly other blocks are written. Putting them in blocks gives us flexibility to put them in any order/anywhere we want to improve performance.

TechQA.

ETL choice, building an ETL that deals with SQL query engine (impala) or native database directly?

There are 1 answers

Related Questions in IMPALA

Related Questions in KUDU

Related Questions in APACHE-KUDU

Popular Questions

Popular Tags

Trending Questions