Can Glue Crawler crawl the deltalake files to create tables in aws glue catalogue?

1.8k views Asked by At

We have an existing infrastructure where we are crawling the S3 directories through aws crawlers. These S3 directories are created as part of AWS datalake and dumped through the spark job. Now in order to implement the delta feature, we were doing a POC on deltalake. So when I wrote these deltalake files in the S3 through our spark-delta Jobs, my crawlers are not able to create tables from these crawlers.

Can we crawl delta lake files using AWS crawlers ?

2

There are 2 answers

1
Prabhakar Reddy On BEST ANSWER

As per this doc you should not be using Glue crawler.You should be using manifest files to integrate delta files with Athena.

Warning

Do not use AWS Glue Crawler on the location to define the table in AWS Glue. Delta Lake maintains files corresponding to multiple versions of the table, and querying all the files crawled by Glue will generate incorrect results.

0
Kyle Duong On

Glue Crawler recently released Delta Lake integration in 2022 where it will parse the Delta transaction log to gather the latest snapshot of the Delta table. It will then create manifest files and create an entry to the Glue Data Catalog which is query-able via Athena or Redshift Spectrum. The table created by the Delta Lake Crawler is also compatible with Lake Formation Cell Level security.

When creating a Delta Lake Crawler, make sure you specify a Delta Target in the console rather than an S3 Target. The crawler can be scheduled and will automatically detect schema evolution in your Delta Lake tables and populate them in the Glue Data Catalog and update any new partitions that it discovers.