How to use dbt seed properly with dbt-spark[PyHive] running in EMR?

Question

How to use dbt seed properly with dbt-spark[PyHive] running in EMR?

1k views Asked by Camila Lima At 12 April 2022 at 15:20

Problem

I am trying to implement a new process using dbt seeds. When I use it in a Redshift connection there is no problem, but when I try to use it with dbt-spark[PyHive] in EMR some problems arise.

First Try

seed-paths: ["seeds"]

seeds:
      dock_analytics_spark:
        seed:
          location_root: s3://dock-analytics/seeds/

Using the code above in the dbt_project.yml I receive the following error:

org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "dbfs"

Second Try

seed-paths: ["seeds"]

seeds:
  dock_analytics_spark:
    +materialized: table
    +schema: seed
    +file_format: parquet
    +location_root: s3://dock-analytics/seeds/

Using the code above in the dbt_project.yml I receive the following error:

org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'None' expecting {'(', 'FROM', 'MAP', 'REDUCE', 'SELECT', 'TABLE', 'VALUES', 'WITH'}(line 14, pos 6)

== SQL == /* {"app": "dbt", "dbt_version": "1.0.4", "profile_name": "dock_spark", "target_name": "dev", "node_id": "seed.dock_analytics_spark.seed_pos_eny_mde_pan_cde"} */
  create table dock_analytics_seed.seed_pos_eny_mde_pan_cde


using parquet




location 's3://dock-analytics/seeds//seed_pos_eny_mde_pan_cde'

as
  None
------^^^

Someone knows how to properly use dbt seed with dbt-spark[PyHive] running in EMR?

Original Q&A

There are 1 answers

**Kevin Li** · Answer 1 · 2022-05-28T02:32:15+00:00

edit 1: The error was caused by the materialized config. If you're storing the seed into hdfs, you should not use it. Or you can use it but not with the location_root config, which means the seed will be stored locally.

Original: I'm not sure if you've figured this out or not. Just posting here for anyone else that running into this error. The AS statement needs a parameter from the profile to set the name for the seed table. This parameter should be specified explicitly like this:

seeds:
    tabel_name:
        +materialized: table
        +file_format: delta
        +location_root: 'hdfs://xxxxx'

TechQA.

How to use dbt seed properly with dbt-spark[PyHive] running in EMR?

Problem

First Try

Second Try

There are 1 answers

Related Questions in APACHE-SPARK

Related Questions in HIVE

Related Questions in DBT

Related Questions in PYHIVE

Popular Questions

Popular Tags

Trending Questions