How to use dbt seed properly with dbt-spark[PyHive] running in EMR?

1k views Asked by At

Problem

I am trying to implement a new process using dbt seeds. When I use it in a Redshift connection there is no problem, but when I try to use it with dbt-spark[PyHive] in EMR some problems arise.

First Try

seed-paths: ["seeds"]

seeds:
      dock_analytics_spark:
        seed:
          location_root: s3://dock-analytics/seeds/

Using the code above in the dbt_project.yml I receive the following error:

org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "dbfs"

Second Try

seed-paths: ["seeds"]

seeds:
  dock_analytics_spark:
    +materialized: table
    +schema: seed
    +file_format: parquet
    +location_root: s3://dock-analytics/seeds/

Using the code above in the dbt_project.yml I receive the following error:

org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'None' expecting {'(', 'FROM', 'MAP', 'REDUCE', 'SELECT', 'TABLE', 'VALUES', 'WITH'}(line 14, pos 6)

== SQL == /* {"app": "dbt", "dbt_version": "1.0.4", "profile_name": "dock_spark", "target_name": "dev", "node_id": "seed.dock_analytics_spark.seed_pos_eny_mde_pan_cde"} */

  create table dock_analytics_seed.seed_pos_eny_mde_pan_cde


using parquet




location 's3://dock-analytics/seeds//seed_pos_eny_mde_pan_cde'

as
  None

------^^^

Someone knows how to properly use dbt seed with dbt-spark[PyHive] running in EMR?

1

There are 1 answers

0
Kevin Li On

edit 1: The error was caused by the materialized config. If you're storing the seed into hdfs, you should not use it. Or you can use it but not with the location_root config, which means the seed will be stored locally.

Original: I'm not sure if you've figured this out or not. Just posting here for anyone else that running into this error. The AS statement needs a parameter from the profile to set the name for the seed table. This parameter should be specified explicitly like this:

seeds:
    tabel_name:
        +materialized: table
        +file_format: delta
        +location_root: 'hdfs://xxxxx'