I am developing deployment via DBX to Azure Databricks. In this regard I need a data job written in SQL to happen everyday. The job is located in the file data.sql
. I know how to do it with a python file. Here I would do the following:
build:
python: "pip"
environments:
default:
workflows:
- name: "workflow-name"
#schedule:
quartz_cron_expression: "0 0 9 * * ?" # every day at 9.00
timezone_id: "Europe"
format: MULTI_TASK #
job_clusters:
- job_cluster_key: "basic-job-cluster"
<<: *base-job-cluster
tasks:
- task_key: "task-name"
job_cluster_key: "basic-job-cluster"
spark_python_task:
python_file: "file://filename.py"
But how can I change it so I can run a SQL job instead? I imagine it is the last two lines of code (spark_python_task:
and python_file: "file://filename.py"
) which needs to be changed.
There are various ways to do that.
(1) One of the most simplest is to add a SQL query in the Databricks SQL lens, and then reference this query via
sql_task
as described here.(2) If you want to have a Python project that re-uses SQL statements from a static file, you can add this file to your Python Package and then call it from your package, e.g.:
(3) A third option is to use the DBT framework with Databricks. In this case you probably would like to use
dbt_task
as described here.