I am trying to query delta tables from my AWS Glue Catalog on Databricks SQL Engine. They are stored in Delta Lake format. I have glue crawlers automating schemas. The catalog is setup & functioning with non Delta Tables. The setup via databricks loads the available tables per database via the catalog & but the query fails due to databricks using hive instead of delta to read.
Incompatible format detected.
A transaction log for Databricks Delta was found at `s3://COMPANY/club/attachment/_delta_log`,
but you are trying to read from `s3://COMPANY/club/attachment` using format("hive"). You must use
'format("delta")' when reading and writing to a delta table.
To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
To learn more about Delta, see https://docs.databricks.com/delta/index.html
SQL Warehouse settings => Data Access Configuration
spark.databricks.hive.metastore.glueCatalog.enabled : true
The crawler using DELTA LAKE setup from AWS produces the following table metadata
{
"StorageDescriptor": {
"cols": {
"FieldSchema": [
{
"name": "id",
"type": "string",
"comment": ""
},
{
"name": "media",
"type": "string",
"comment": ""
},
{
"name": "media_type",
"type": "string",
"comment": ""
},
{
"name": "title",
"type": "string",
"comment": ""
},
{
"name": "type",
"type": "smallint",
"comment": ""
},
{
"name": "clubmessage_id",
"type": "string",
"comment": ""
}
]
},
"location": "s3://COMPANY/club/attachment/_symlink_format_manifest",
"inputFormat": "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat",
"outputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
"compressed": "false",
"numBuckets": "-1",
"SerDeInfo": {
"name": "",
"serializationLib": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
"parameters": {}
},
"bucketCols": [],
"sortCols": [],
"parameters": {
"UPDATED_BY_CRAWLER": "CRAWLER_NAME",
"CrawlerSchemaSerializerVersion": "1.0",
"CrawlerSchemaDeserializerVersion": "1.0",
"classification": "parquet"
},
"SkewedInfo": {},
"storedAsSubDirectories": "false"
},
"parameters": {
"UPDATED_BY_CRAWLER": "CRAWLER_NAME",
"CrawlerSchemaSerializerVersion": "1.0",
"CrawlerSchemaDeserializerVersion": "1.0",
"classification": "parquet"
}
}
I am facing the same problem. It seems like you can not use Spark SQL to query a delta table in Glue, because setting
implies that the table will be a hive table. You will need to access the table in S3 directly, losing the advantages of the meta data catalog.
You can read from it though, by blocking your cluster from accessing the _delta_log folder with the following IAM policy: