ETL choice, building an ETL that deals with SQL query engine (impala) or native database directly?

148 views Asked by At

I am trying to build an ETL that map the source tables to a dimensional, star schema model

our data warehouse is basically Impala on top of Kudu database

my question is, should I:

A- build an ETL that deals with kudu tables directly using Python (link)

or

B- or create UDFs (equivalent to stored procedures in SQL) in impala that does the insertion/joins etc to map source tables to star-schema model, and schedule it using Nifi or any scheduler such as Airflow etc

In my opinion, I think it would be better to deal with the native database rather than dealing with the SQL engine on top of it. but it is just an assumption.

1

There are 1 answers

3
Koushik Roy On BEST ANSWER

Why not approach C, :) a bit of both.

Both has pros and cons.

  • A - use python to build ETL - pros - better control, flexible to do any logic you want. cons - you have to code in python and code in sql. If something fails, it will be a nightmare to do RCA. Maintenance may be harder in comparison. - performance wise, this approach will be poorer in case of huge volume of data.

  • B - Use SQL to fetch data directly - pros - faster performance. less coding. cons - difficult to implement complex logic. Maintenance of code and schedule may be hard.

In addition to above, pls consider, your/teams comfort on python/SQL and future maintainability.
Currently we are using approach B in my cloudera project. We create views and then use insert to load final tables directly. We hardly need any UDF.
Now, my recommendation, please use approach B. And use approach A only in case you really can not create complex logic.

EDIT : Lets say, we have to load orders table. So we execute following blocks to load orders and dependent org,cust,prod tables.

Load customer   |
load org        | --> Load Orders final.
load product    |
load order stage|

Load customer block is collection of scripts like-

insert overwrite cust_stg select * from cust_stg_vw; -- This loads into stage table
insert overwrite cust select * from cust_vw; -- This loads into cust table

And similarly other blocks are written. Putting them in blocks gives us flexibility to put them in any order/anywhere we want to improve performance.