Problem
I am trying to implement a new process using dbt seeds. When I use it in a Redshift connection there is no problem, but when I try to use it with dbt-spark[PyHive] in EMR some problems arise.
First Try
seed-paths: ["seeds"]
seeds:
dock_analytics_spark:
seed:
location_root: s3://dock-analytics/seeds/
Using the code above in the dbt_project.yml I receive the following error:
org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "dbfs"
Second Try
seed-paths: ["seeds"]
seeds:
dock_analytics_spark:
+materialized: table
+schema: seed
+file_format: parquet
+location_root: s3://dock-analytics/seeds/
Using the code above in the dbt_project.yml I receive the following error:
org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'None' expecting {'(', 'FROM', 'MAP', 'REDUCE', 'SELECT', 'TABLE', 'VALUES', 'WITH'}(line 14, pos 6)
== SQL == /* {"app": "dbt", "dbt_version": "1.0.4", "profile_name": "dock_spark", "target_name": "dev", "node_id": "seed.dock_analytics_spark.seed_pos_eny_mde_pan_cde"} */
create table dock_analytics_seed.seed_pos_eny_mde_pan_cde using parquet location 's3://dock-analytics/seeds//seed_pos_eny_mde_pan_cde' as None
------^^^
Someone knows how to properly use dbt seed with dbt-spark[PyHive] running in EMR?
edit 1: The error was caused by the materialized config. If you're storing the seed into hdfs, you should not use it. Or you can use it but not with the location_root config, which means the seed will be stored locally.
Original: I'm not sure if you've figured this out or not. Just posting here for anyone else that running into this error. The AS statement needs a parameter from the profile to set the name for the seed table. This parameter should be specified explicitly like this: