Glue Catalog w/ Delta Tables Connected to Databricks SQL Engine

1.4k views Asked by At

I am trying to query delta tables from my AWS Glue Catalog on Databricks SQL Engine. They are stored in Delta Lake format. I have glue crawlers automating schemas. The catalog is setup & functioning with non Delta Tables. The setup via databricks loads the available tables per database via the catalog & but the query fails due to databricks using hive instead of delta to read.

Incompatible format detected.

A transaction log for Databricks Delta was found at `s3://COMPANY/club/attachment/_delta_log`,
but you are trying to read from `s3://COMPANY/club/attachment` using format("hive"). You must use
'format("delta")' when reading and writing to a delta table.

To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
To learn more about Delta, see https://docs.databricks.com/delta/index.html

SQL Warehouse settings => Data Access Configuration

spark.databricks.hive.metastore.glueCatalog.enabled : true

The crawler using DELTA LAKE setup from AWS produces the following table metadata

{
    "StorageDescriptor": {
        "cols": {
            "FieldSchema": [
                {
                    "name": "id",
                    "type": "string",
                    "comment": ""
                },
                {
                    "name": "media",
                    "type": "string",
                    "comment": ""
                },
                {
                    "name": "media_type",
                    "type": "string",
                    "comment": ""
                },
                {
                    "name": "title",
                    "type": "string",
                    "comment": ""
                },
                {
                    "name": "type",
                    "type": "smallint",
                    "comment": ""
                },
                {
                    "name": "clubmessage_id",
                    "type": "string",
                    "comment": ""
                }
            ]
        },
        "location": "s3://COMPANY/club/attachment/_symlink_format_manifest",
        "inputFormat": "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat",
        "outputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
        "compressed": "false",
        "numBuckets": "-1",
        "SerDeInfo": {
            "name": "",
            "serializationLib": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
            "parameters": {}
        },
        "bucketCols": [],
        "sortCols": [],
        "parameters": {
            "UPDATED_BY_CRAWLER": "CRAWLER_NAME",
            "CrawlerSchemaSerializerVersion": "1.0",
            "CrawlerSchemaDeserializerVersion": "1.0",
            "classification": "parquet"
        },
        "SkewedInfo": {},
        "storedAsSubDirectories": "false"
    },
    "parameters": {
        "UPDATED_BY_CRAWLER": "CRAWLER_NAME",
        "CrawlerSchemaSerializerVersion": "1.0",
        "CrawlerSchemaDeserializerVersion": "1.0",
        "classification": "parquet"
    }
}
2

There are 2 answers

5
Robert Kossendey On

I am facing the same problem. It seems like you can not use Spark SQL to query a delta table in Glue, because setting

spark.databricks.hive.metastore.glueCatalog.enabled : true

implies that the table will be a hive table. You will need to access the table in S3 directly, losing the advantages of the meta data catalog.

You can read from it though, by blocking your cluster from accessing the _delta_log folder with the following IAM policy:

{ "Sid": "BlockDeltaLog", "Effect": "Deny", "Action": "s3:*", "Resource": [ "arn:aws:s3:::BUCKET" ], "Condition": { "StringLike": { "s3:prefix": [ "_delta_log/" ] } } } 
0
skud On

I was able to query a delta table created by glue crawlers after updating the location. In your case it would need to be changed from: s3://COMPANY/club/attachment/_symlink_format_manifest to s3://COMPANY/club/attachment

This is because delta on spark doesn't look at _symlink_format_manifest like hive and presto. It just needs to know the root directory.

The command in databricks to update the location is something like this:

ALTER table my_db.my_table
SET LOCATION "s3://COMPANY/club/attachment"

Note: your database location has to be set as well in order for that command to work