Why Iceberg tables created in Spark in EMR notebook do not show schema in AWS Glue and AWS Athena?

322 views Asked by At

By following the AWS document at Use an Iceberg cluster with Spark and iceberg doc at Iceberg AWS Integrations, my goal is to create the Iceberg tables in Spark in an EMR workspace/notebook, and query such tables in Athena by having them registered in the Glue catalog.

I have started the EMR 6.5.0 cluster with the following configuration, also mentioned in the same AWS document above:

[{
    "Classification":"iceberg-defaults",
    "Properties":{"iceberg.enabled":"true"}
}]

With the goal in mind, I have made the following Spark configurations to enable the use of the Glue catalog. These are also based in the AWS document and in the iceberg docs given above, for the latest and previous versions of iceberg.

%%configure -f
{
    "conf":{
        "spark.sql.catalog.my_catalog": "org.apache.iceberg.spark.SparkCatalog",
        "spark.sql.catalog.my_catalog.warehouse": "s3://my-bucket/db",
        "spark.sql.catalog.my_catalog.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
        "spark.sql.catalog.my_catalog.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
        "spark.sql.catalog.my_catalog.lock-impl": "org.apache.iceberg.aws.glue.DynamoLockManager",
        "spark.sql.catalog.my_catalog.lock.table": "myGlueLockTable"
    }
}

With the configured and running Spark Session, I create a database, and later a table within my_catalog.

spark.sql("""
CREATE DATABASE my_catalog.db_test;
""")


spark.sql("""
CREATE TABLE my_catalog.db_test.table_20231012 (
id bigint,
name string,
<other fields>
) USING iceberg;
""")

By default, the table location would be in my-warehouse-location/my-ns.db/my-table as mentioned in Iceberg AWS integration doc.

Prior to that I also included a LOCATION parameter in the query after the “USING iceberg”, like so:

LOCATION 's3://my-warehouse-location/table_20231012'

The results of these were the following:

  • The resulting table’s metadata is stored in the designated warehouse location. Without the LOCATION query parameter, it is at my-warehouse-location/my-ns.db/my-table .
  • The table is visible in Athena, in the given database, with the given table name. I can generate its CREATE TABLE/DDL script. However, its columns do not load (expanding the columns in the sidebar gets stuck with the loading animation) which also causes an error in previewing the table. The error message is COLUMN_NOT_FOUND: line 1:8: SELECT * not allowed from relation that has no columns
  • Similarly, the database and table are visible in Glue, but without the schema columns. It records the table name and database name, the metadata_location and that table_type is ICEBERG but none on the Location, Input format, Output format, Serde serialization lib, etc.
  • My workaround is to manually add the table columns in the Schema section of the table in the Glue console, after which it starts to work like a normal table which I can query in Athena.

I expect the schema to reflect properly in Glue after table creation in the configured Spark app because none of the documentations describe any more steps than the ones I took. Please suggest what I may do or learn further for this scenario.

0

There are 0 answers