I am using sqlalchemy with pyhive connector to execute SQL statements on Spark cluster, which running hive thrift server.
Most things are straightforward with native spark and delta tables. I.g. I can drop and create tables like
from sqlalchemy import create_engine
engine = create_engine('hive://hive-thriftserver-host:10000/default', echo=True)
engine.execute("DROP TABLE IF EXISTS baz")
engine.execute("CREATE TABLE baz (id int not null) USING DELTA")
engine.execute("DROP TABLE IF EXISTS baz")
engine.execute("CREATE TABLE baz (id int not null) USING DELTA")
So everything is ok and behave like expected.
But with Iceberg tables things become complicated. It looks like DDL commands are just scheduled for execution, so performed asynchronously in background even in transaction, so after drop table
it removed from the catalog instantly, but some processes prevents from creation of same table during some period of time (~30 seconds).
engine.execute("DROP TABLE IF EXISTS bear PURGE")
engine.execute("CREATE TABLE bear (id int not null) USING ICEBERG")
engine.execute("DROP TABLE IF EXISTS bear PURGE")
engine.execute("CREATE TABLE bear (id int not null) USING ICEBERG")
sqlalchemy.exc.OperationalError: (pyhive.exc.OperationalError) TExecuteStatementResp(...
[TABLE_OR_VIEW_ALREADY_EXISTS] Cannot create table or view `default`.`bear` because it already exists.
Choose a different name, drop or replace the existing object, or add the IF NOT EXISTS clause to tolerate pre-existing objects.'))
[SQL: CREATE TABLE bear (id int not null) USING ICEBERG]
After a while I can create this table sucessfuly.
Could anyone please give ideas how to manage this and deal with such behavior? Again - checking existence of table with SHOW TABLES IN default
does not working, because tables drops from catalog instantly. I searched for config options for table creation, but find nothing about sync/async behavior.
Also there is an issue, even after DROP TABLE
that I can not create table with same name using spark or delta, it looks like even after DROP TABLE bear PURGE
table content remains in storage and prevents from creation other table types.
engine.execute("DROP TABLE IF EXISTS bear PURGE")
engine.execute("CREATE TABLE bear (id int not null)")
sqlalchemy.exc.OperationalError: (pyhive.exc.OperationalError) TExecuteStatementResp(...
[LOCATION_ALREADY_EXISTS] Cannot name the managed table as `spark_catalog`.`default`.`bear`, as its associated location 's3a://s3host/user/hive/warehouse/bear' already exists.
Please pick a different table name, or remove the existing location first."), operationHandle=None)
[SQL: CREATE TABLE bear (id int not null)]
So it double strangeness of such async DDL behavior.